Comparing JSON using Git

This was a weekend experiment comparing same, but different json files.

tl;dr: Use git attributes smudge & clean feature to compare json when staging(git add). If you want to cheat and just start comparing json, you can use this repo that sets this behavior by running config.sh.

two test json files

The following two valid json data is semantically the same, but is written differently.

a0.json

{
  "a" : {},
  "b" : {}
}

a1.json

{"b" : {}, "a" : {}}

We want to take into account two things here, one is formatting, and the other is the order of the properties under json objects.

Current status of versioning json files in git

Normally, if we try to do…

this

mkdir json_test
cd json_test
git init
git add a.json
git commit -m "a0"
#
# we change the contents to a1.json
#
git diff

We get…

diff --git a/a.json b/a.json
index aaf4bfe..d5d2f69 100644
--- a/a.json
+++ b/a.json
@@ -1,4 +1 @@
-{
-  "a" : {},
-  "b" : {}
-}
+{"b" : {}, "a" : {}}

This is working as intended because the diff command defaults to get the Levenshtein distance of lines. To make it closer to compare the contents, one strategy is to convert this json output that fits with this diff behavior.

filters through Git Attributes1

I thought this kind of task is for git hooks, like a middleware that you can intercept and do things in the process. But here we are talking about pre-staging not pre-commit.

This may potentially be a feature request, because if you know the existence of hooks, for me it’s natural to think there is a pre-stage hook.

Looks like you can either:

  1. Treat json files as binary data, and write a filter that specifies how make it comparable
  2. use smudge and clean filters to modify the *.json files before staging.

1. Treat JSON files as binary data, custom pre-diff behavior.

The situation illustrated in the git book chacon2014pro is similar, it takes a xcode configuration file written in json, treats it as a binary. The claim is that despite it’s plain-text(utf-8) and readable, this .pbxproj file is primarily consumed by a machine, hence should treat as binary. JSON files should be lying in this grey area of consumption, required to easily digested by humans and machines. It is an interesting crossing point of human and machine interaction, because in a way it is the meeting point of the two kinds.

We create a .gitattributes file and designate json files as binary. We will no longer see that output, instead just indicating that is a different blob.

*.json binary
git diff

will output

diff --git a/.gitattributes b/.gitattributes
index 12c8aef..568262a 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -1 +1 @@
-.json binary
+*.json binary
diff --git a/a.json b/a.json
index aaf4bfe..d5d2f69 100644
Binary files a/a.json and b/a.json differ

Then the book goes on ranting about Microsoft Word, and how it’s the

(Everyone knows that Word is the) most horrific editor around, but oddly, everyone still uses it. – Pro Git

.docx documents are binary so it demonstrates it by using docx2txt to convert it to make it compareable. neat! So, now we want to specify how to make our two json files diff able.

Let’s experiment with this simple and dumb python script to do so. Our simple example requires that the two json presented above is the same.

import json

a0 = """{
  "a": {},
  "b": {}
}"""

a1 = '{"b" : {}, "a" : {}}'

a0 = json.loads(a0)
a1 = json.loads(a1)

dump0 = json.dumps(a0, sort_keys=True, indent=2)
dump1 = json.dumps(a1, sort_keys=True, indent=2)

return dump0 == dump1

this will result to:

True

So really all we have to do is sort the keys and prettify it and then dump it to stdout.

json\_std\_conv.py

#!/usr/bin/python

import sys
import json

file_name = sys.argv[1]

file = open(file_name, 'r')
data = json.loads(file.read())
file.close()

print(json.dumps(data, sort_keys=True, indent=2))

Don’t forget to add exec permissions.

chmod u+x json_std_conv.py

then we need to do two things, change the .gitattributes and gitconfig

*.json binary
*.json diff=json

This is saying use json command before running diff for *.json files. It would be straight forward if we could just point to the script, but we need to handle one more abstraction though git configs.

git config diff.json.textconv ./json_std_conv.py

This will add a config for the local repository, and will not affect the global or user config.

Now if you try to add the same json data but written differently, git diff won’t produce any output.

Note that you can still add it and stage it, this method is just to have a sensible json output when performing diff

2. Use Smudge and Clean

The first method didn’t modify the actual contents of the file, it just helps to provide a better output for git diff json files. This might be all you need, because you want to version json files in a more compact way. Internally, git uses Coglan2019 Deflate to compress raw files, so git wise it isn’t much of a problem then we think. A possible situation is that the json you are considering to POST is close to the data limit and you really want to strip those whitespaces. Me personally, I don’t think the first is so much useful, rather causes misunderstanding because of the inconsistency between the git diff output and what git status tells you.

But what if want to change the contents of the file and have a consistent format that you can compare through your project? For this, we need to use smudging and cleaning

Figure 1: smudge (taken from the Progit book)

Figure 1: smudge (taken from the Progit book)

Figure 2: clean (taken from the Progit book)

Figure 2: clean (taken from the Progit book)

As we can see from the images, smudge and clean points to opposite directions when moving contents between the working directory (the things you actually see) and the staging area (the contents that you want to version). Frankly speaking smudge changes the file contents when you checkout, and clean changes file when you run git add. The direction we are interested is clean. We want to produce a consistent json file each time we run git add. For smudge, we don’t want to change anything.

We need to slightly change our converter to read from stdin. We want to get the whole buffer so python3’s built-in input() function won’t work.

#!/usr/bin/python3

import sys
import json

inp = sys.stdin

data = json.loads(inp.read())

print(json.dumps(data, sort_keys=True, indent=2))

Additionally, similar to the first method, we need to modify both git config and attributes.

git config filter.consistent.clean ./json_std_conv.py
git config filter.consistent.smudge cat

Since we don’t want to change anything when we run git checkout, we use plain cat command. Internet teaches me that for trivial json formatting & printing, there is a tool in python that you can run from command line 2, but this json conversion script might go beyond just printing and re-ordering, so for this post I’m sticking with this script.

…and for the .gitattributes file, we put

*.json filter=consistent

Now when we try to git add the same json data (but different file content). Nothing will be staged and git status will show you it’s there is nothing to commit.

Wrapping up

We explored Git’s features to have a better comparison scheme for json data using git attributes and config. By using this technique, we can now compare json objects, when given different formatting or property order among versions. The git book specify two ways to handle this, but the second method better suits my goal, which is comparing geojson across history.

This hasn’t gone through testing different json inputs yet, and the behavior for array elements is left unexplored. In addition, It might be neat if we can produce a json-patch (RFC6902) cao2016json file from diff results.

The same technique could be used for different markup languages, like xml (html) and other ones like yaml and toml, as long as there is a clear sorting protocol.3

Version Control

Bibliography

[chacon2014pro] Chacon & Straub, Pro git, Apress (2014).

[Coglan2019] James Coglan, Building Git, James Coglan (2019).

[cao2016json] Cao, Falleri, Blanc & Zhang, JSON Patch for Turning a Pull REST API into a Push, 435-449, in in: International Conference on Service-Oriented Computing, edited by (2016)


  1. https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes ↩︎

  2. We can even use jq for this, but then you need to install stuff, gets less universal than assuming you have a python3 interpreter in path. ↩︎

  3. The rule for git sorting the files and directories in one node(directory) are sorted using the relative path from the repo root. This way it guarantees that the sha-1 hash is consistent. ↩︎