What would actually happen if I had a hash collision while using git?
E.g. I manage to commit two files with the same sha1 checksum, would git notice it or corrupt one of the files?
Could git be improved to live with that, or would I have to change to a new hash algorithm?
(Please do not deflect this question by discussing how unlikely that is - Thanks)
Google now claims that SHA-1 collision is possible under certain preconditions: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
Since git uses SHA-1 to check for file integrity, this means that file integrity in git is compromised.
IMO, git should definitely use a better hashing algorithm since deliberate collision is now possible.
Picking atoms on 10 Moons
An SHA-1 hash is a 40 hex character string... that's 4 bits per character times 40... 160 bits. Now we know 10 bits is approximately 1000 (1024 to be exact) meaning that there are 1 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 different SHA-1 hashes... 1048.
What is this equivalent of? Well the Moon is made up of about 1047 atoms. So if we have 10 Moons... and you randomly pick one atom on one of these moons... and then go ahead and pick a random atom on them again... then the likelihood that you'll pick the same atom twice, is the likelihood that two given git commits will have the same SHA-1 hash.
Expanding on this we can ask the question...
How many commits do you need in a repository before you should start worrying about collisions?
This relates to so called "Birthday attacks", which in turn refers to the "Birthday Paradox" or "Birthday Problem", which states that when you pick randomly from a given set, you need surprisingly few picks before you are more likely than not to have picked something twice. But "surprisingly few" is a very relative term here.
Wikipedia has a table on the probability of Birthday Paradox collisions. There is no entry for a 40 character hash. But an interpolation of the entries for 32 and 48 characters lands us in the range of 5*1022 git commits for a 0.1% probability of a collision. That is fifty thousand billion billion different commits, or fifty Zettacommits, before you have reached even a 0.1% chance that you have a collision.
The byte sum of the hashes alone for these commits would be more data than all the data generated on Earth for a year, which is to say you would need to churn out code faster than YouTube streams out video. Good luck with that. :D
The point of this is that unless someone is deliberately causing a collision, the probability of one happening at random is so staggeringly small you can ignore this issue
"But when a collision does occur, then what actually happens?"
Ok, suppose the improbable does happen, or suppose someone managed to tailor a deliberate SHA-1 hash collision. What happens then?
In that case there is an excellent answer where someone experimented on it. I will quote from that answer:
As you can seem some cases are not good. Especially cases #2 and #3 messes up your repository. However, it does seem that the fault stays within that repository, and the attack/bizarre improbability does not propagate to other reposistories.
Also it seems that the issue of deliberate collisions is being recognised as a real threat, and so for instance GitHub is taking measures to prevent it.
Collisions are possible for any hash algorithm, so changing the hash function doesn't preclude the problem, it just makes it less likely to happen. So you should choose then a really good hash function (SHA-1 already is, but you asked not to be told :)
I recently found a posting from 2013-04-29 in a BSD discussion group at
http://openbsd-archive.7691.n7.nabble.com/Why-does-OpenBSD-use-CVS-td226952.html
where the poster claims:
Unfortunately, he provides no proof for his claim. But maybe you would like trying to contact him and ask him about this supposed incident.
But on a more general level, due to the birthday attack a chance for an SHA-1 hash collision is 1 in pow(2, 80).
This sounds a lot and is certainly way more than the total number of versions of individual files present in all Git repositories of the world combined.
However, this only applies to the versions which actually remain in version history.
If a developer relies very much on rebasing, every time a rebase is run for a branch, all the commits in all the versions of that branch (or rebased part of the branch) get new hashes. The same is true for every file modifies with "git filter-branch". Therefore, "rebase" and "filter-branch" might be big multipliers for the number of hashes generated over time, even though not all of them are actually kept: Frequently, after rebasing (especially for the purpose of "cleaning up" a branch), the original branch is thrown away.
But if the collision occurs during the rebase or filter-branch, it can still have adverse effects.
Another thing would be to estimate the total number of hashed entities in git repositories and see how far they are from pow(2, 80).
Let's say we have about 8 billion people, and all of them would be running git and keep their stuff versioned in 100 git repositories per person. Let' further assume the average repository has 100 commits and 10 files, and only one of those files changes per commit.
For every revision we have at least a hash for the tree object and the commit object itself. Together with the changed file we have 3 hashes per revision, and thus 300 hashes per repository.
For 100 repositories of 8 billion people this gives pow(2, 47) which is still far from pow(2, 80).
However, this does not include the supposed multiplication effect mentioned above, because I am uncertain how to include it in this estimation. Maybe it could increase the chances for a collision considerably. Especially if very large repositories which a long commit history (like the Linux Kernel) are rebased by many people for small changes, which nevertheless create different hashes for all affected commits.
Well I guess we now know what would happen - you should expect that your repository would become corrupted (source).
You can see a good study in "How would Git handle a SHA-1 collision on a blob?".
Since a SHA1 collision is now possible (as I reference in this answer with shattered.io), know that Git 2.13 (Q2 2017) will improve/mitigate the current situation with a "detect attempt to create collisions" variant of SHA-1 implementation by Marc Stevens (CWI) and Dan Shumow (Microsoft).
See commit f5f5e7f, commit 8325e43, commit c0c2006, commit 45a574e, commit 28dc98e (16 Mar 2017) by Jeff King (
peff
).(Merged by Junio C Hamano --
gitster
-- in commit 48b3693, 24 Mar 2017)Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".
You will be able to use another hash algorithm: SHA1 is no longer the only one for Git.
Git 2.18 (Q2 2018) documents that process.
See commit 5988eb6, commit 45fa195 (26 Mar 2018) by Ævar Arnfjörð Bjarmason (
avar
).(Merged by Junio C Hamano --
gitster
-- in commit d877975, 11 Apr 2018)So the new documentation now reads:
Note: that same document now (Q3 2018, Git 2.19) explicitly references the "new hash" as SHA-256: see "Why doesn't Git use more modern SHA?".