Hash collision in git

2020-01-25 04:34发布

What would actually happen if I had a hash collision while using git?

E.g. I manage to commit two files with the same sha1 checksum, would git notice it or corrupt one of the files?

Could git be improved to live with that, or would I have to change to a new hash algorithm?

(Please do not deflect this question by discussing how unlikely that is - Thanks)

9条回答
在下西门庆
2楼-- · 2020-01-25 05:05

A hash collision is so highly unlikely, that it is sheer mind blowing! Scientists all over the world are trying hard to achieve one, but didn't manage it yet. For certain algorithms such as MD5 they successed, though.

What are the odds?

SHA-256 has 2^256 possible hashes. That is about 10^78. Or to be more graphic, the chances of a collision are at about

1 : 100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000

The chance of winning the lottery is about 1 : 14 Mio. The chance of a collision with SHA-256 is like winning the lottery on 11 consecutive days!

Mathematic explanation: 14 000 000 ^ 11 ~ 2^256

Furthermore, the universe has about 10^80 atoms. That's just 100 times more than there are SHA-256 combinations.

Successful MD5 collision

Even for MD5 the chances are tiny. Though, mathematicians managed to create a collision:

d131dd02c5e6eec4 693d9a0698aff95c 2fcab58712467eab 4004583eb8fb7f89
55ad340609f4b302 83e488832571415a 085125e8f7cdc99f d91dbdf280373c5b
d8823e3156348f5b ae6dacd436c919c6 dd53e2b487da03fd 02396306d248cda0
e99f33420f577ee8 ce54b67080a80d1e c69821bcb6a88393 96f9652b6ff72a70

has the same MD5 as

d131dd02c5e6eec4 693d9a0698aff95c 2fcab50712467eab 4004583eb8fb7f89
55ad340609f4b302 83e4888325f1415a 085125e8f7cdc99f d91dbd7280373c5b
d8823e3156348f5b ae6dacd436c919c6 dd53e23487da03fd 02396306d248cda0
e99f33420f577ee8 ce54b67080280d1e c69821bcb6a88393 96f965ab6ff72a70

This doesn't mean that MD5 is less safe now that its algorithm is cracked. You can create MD5 collisions on purpose, but the chance of an accidental MD5 collision is still 2^128, which is still a lot.

Conclusion

You don't have to have a single worry about collisions. Hashing algorithms are the second safest way to check file sameness. The only safer way is a binary comparison.

查看更多
你好瞎i
3楼-- · 2020-01-25 05:08

It's not really possible to answer this question with the right "but" without also explaining why it's not a problem. It's not possible to do that without really having a good grip on what a hash really is. It's more complicated than the simple cases you might have been exposed to in a CS program.

There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.

So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.

It's not wrong to think about this - it's a good question that a lot of people just pass off as "so unlikely it's not worth thinking about" - but it's really a little more complicated than that. If it DOES happen, it should be very readily detectible, it won't be a silent corruption in a normal workflow.

查看更多
够拽才男人
4楼-- · 2020-01-25 05:11

If two files have the same hash sum in git, it would treat those files as identical. In the absolutely unlikely case this happens, you could always go back one commit, and change something in the file so they wouldn't collide anymore ...

See Linus Torvalds' post in the thread “Starting to think about sha-256?” in the git mailing list.

查看更多
登录 后发表回答