Recently a team of researchers generated two files with the same SHA-1 hash (https://shattered.it/).
Since Git uses this hash for its internal storage, how far does this kind of attack influence Git?
Recently a team of researchers generated two files with the same SHA-1 hash (https://shattered.it/).
Since Git uses this hash for its internal storage, how far does this kind of attack influence Git?
Maybe Linus' response might shed some light:
Source: https://marc.info/?l=git&m=148787047422954
Edit, late December 2017: Git version 2.16 is gradually acquiring internal interfaces to allow for different hashes. There is a long way to go yet.
The short (but unsatisfying) answer is that the example files are not a problem for Git—but two other (carefully calculated) files could be.
I downloaded both of these files,
shattered-1.pdf
andshattered-2.pdf
, and put them into a new empty repository:Even though the two files have the same SHA-1 checksum (and display mostly the same, although one has a red background and the other has a blue background), they get different Git hashes:
Those are the two SHA-1 checksums for the files as stored in Git: one is
ba9aa...
and the other isb621e...
. Neither is38762c...
. But—why?The answer is that Git stores files, not as themselves, but rather as the string literal
blob
, a blank, the size of the file decimalized, and an ASCII NUL byte, and then the file data. Both files are exactly the same size:so both are prefixed with the literal text
blob 422435\0
(where\0
represents a single byte, a la C or Python octal escapes in strings).Perhaps surprisingly—or not, if you know anything of how SHA-1 is calculated—adding the same prefix to two different files that nonetheless produced the same checksum before, causes them to now produce different checksums.
The reason this should become unsurprising is that if the final checksum result were not exquisitely sensitive to the position, as well as the value, of each input bit, it would be easy to produce collisions on demand by taking a known input file and merely re-arranging some of its bits. These two input files produce the same sum despite having a different byte at
char 193, line 8
, but this result was achieved, according to the researchers, by trying over 9 quintillion (short scale) inputs. To get that result, they put in carefully chosen blocks of raw data, at a position they controlled, that would affect the sums, until they found pairs of inputs that resulted in a collision.By adding the
blob
header, Git moved the position, destroying the 110-GPU-years of computation in a single more or less accidental burp.Now, knowing that Git will do this, they could repeat their 110-GPU-years of computation with inputs that begin with
blob 422435\0
(provided their sacrificial blocks don't get pushed around too much; and the actual number of GPU-years of computation needed would probably vary, as the process is a bit stochastic). They would then come up with two different files that could have theblob
header stripped off. These two files would now have different SHA-1 checksums from each other, but whengit add
-ed, both would produce the same SHA-1 checksum.In that particular case, the first file added would "win" the slot. (Let's assume it's named
shattered-3.pdf
.) A good-enough Git—I'm not at all sure that the current Git is this good; see Ruben's experiment-based answer to How would Git handle a SHA-1 collision on a blob?—would notice thatgit add shattered-4.pdf
, attempting to add the second file, collided with the first-but-differentshattered-3.pdf
and would warn you and fail thegit add
step. In any case you would be unable to add both of these files to a single repository.But first, someone has to spend a lot more time and money to compute the new hash collision.