Git uses a delta compression to store objects that are similar to each-other.
Is this algorithm standardized and used in other tools as well? Is there documentation describing the format? Is it compatible with xdelta/VCDIFF/RFC 3284?
Git uses a delta compression to store objects that are similar to each-other.
Is this algorithm standardized and used in other tools as well? Is there documentation describing the format? Is it compatible with xdelta/VCDIFF/RFC 3284?
Git delta encoding is copy/insert based.
This means that the derived file is encoded as a sequence of opcodes which can represent copy instructions(eg: copy from the base file y bytes starting from offset x into the target buffer) or insert instructions(eg: insert the next x bytes into the target buffer).
As a very simple example(taken from the paper 'File System Support for Delta Compression'), consider that we want to create a delta buffer to transform the text "proxy cache" into "cache proxy". The resulting instructions should be:
Which translated to git's encoding becomes:
(bytes 1-3 represent the first instruction)
(bytes 4-6 represent the second instruction)
(bytes 7-8 represent the last instruction)
Notice that in the last copy instruction does not specify an offset which means offset 0. Other bits in the copy opcode can also be set when bigger offsets/lengths are needed.
The result delta buffer has in this example has 8 bytes, which is not much of a compression since the target buffer has 12 bytes, but when this encoding applied to large text files it can make a huge difference.
I have recently pushed a node.js library to github which implements both diff/patch functions using git delta encoding. The code should be more readable and commented than the one in git source, which is heavily optimized.
I also have written a few tests that explain the output opcodes used in each example with a format similar to the above.
I think the diff algo used for pack files was linked to one of the delta encoding out there: initially (2005) xdelta, and then libXDiff.
But then, as detailed below, it shifted to a custom implementation.
Anyway, as mentioned here:
(note: creating many packfiles, or retrieving information in huge packfile is costly, and explain why git doesn't handle well huge files or huge repo.
See more at "git with large files")
This thread also reminds us:
LibXDiff is mentioned in this 2008 thread.
However, since then, the algo has evolved, probably in a custom one, as this 2011 thread illustrates, and as the header of
diff-delta.c
points out:More on the packfiles the Git Book:
Git 2.18 adds to the delta description in this new documentation section, which now (Q2 2018) states:
The pack format is part of a public API: the transfer protocols used for push and fetch operations use it to send less data over the network.
They are implemented in at least two other major Git implementations besides the reference: JGit and libgit2.
Therefore, it is very unlikely that there will be backwards incompatible changes to the format, and can be considered to be "standardized" in that sense.
This amazing file from the docs describes the heuristics used in the pack algorithm as a funny commentary on an email by Linus: https://github.com/git/git/blob/v2.9.1/Documentation/technical/pack-heuristics.txt