How does SHA generate unique codes for big files i

2019-01-29 14:53发布

问题:

Using Git I don't understand how using SHA you can generate just a 40 hexadecimal digit code that can then be mapped to any file which could be hundreds of lines long.

The way I'm thinking of it, lets say the string '1' -> 00...01, the string '2' -> 00..02, the string 'a34ed..fc' -> a34ed..fc etc so the hash map is returning itself then it's clear that all the hash codes get used up very quickly and any string 41 characters long will be reusing one of the codes.

Also I know it's known that SHA doesn't guarantee that it will always be unique but I don't see how it even comes close to being useful.

回答1:

A SHA-1 hash is 160 bits long. That gives you 2160, or exactly

1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976

possible hashes.

Assuming hash values are more or less unpredictable, the odds of two files accidentally having the same hash are infinitesimal to the point that it's just not worth worrying about it.

Quoting from Scott Chacon's book "Pro Git":

However, you should be aware of how ridiculously unlikely this scenario is. The SHA–1 digest is 20 bytes or 160 bits. The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about 280.

...

Here’s an example to give you an idea of what it would take to get a SHA–1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA–1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

It's true that there must be two 21-byte files that have the same SHA-1 hash (since there are 2168 such files and only 2160 possible SHA-1 hashes). No such files have ever been discovered.

UPDATE : As of February 2017, two distinct PDF files with identical SHA-1 checksums have been generated, using a technique that's more than 100,000 times as fast as a brute force attack. Details here: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

Linux Torvalds (the author of Git) has posted a (preliminary) response here: http://marc.info/?l=git&m=148787047422954



回答2:

In fact, what I call your "margin of safety" determines how many objects you can store.

The widely quoted "about 280" number is the point at which you have approximately a 50% chance of a hash collision. To keep the chance below about 1 out of 1018, the number of distinct objects in the repository should not exceed about 1.7 quadrillion (1.71x1015).

(I did some math for a book I'm working on; I haven't had it checked by a real mathematician, but when I ran the same sort of numbers against other hash sizes, my outputs agreed with those on Wikipedia, for whatever that's worth. :-) )

Edit to add: here's the approximation formula. Let r be the cardinality of the hash function (so r is 2160 for SHA-1) and U be the desired probability-of-uniqueness (so U is 0.5 for the usual "50% chance of safety, 50% chance of collision" statistic. The maximum number of hash inputs is:

(1 + sqrt(1 + 8r ln (1 / U)) / 2

The natural log of 1 / .5 is about 0.693, so we have about sqrt(4r)/2, which is of course just about sqrt(r). Hence for a k-bit hash, "50% probability of uniqueness" occurs after about k/2 hashes.

To see (ballpark) how I get my number—in the neighborhood of 1015 objects—let U = 1 - 10-18. The natural log of this number is basically the original 10-18, which means we knock most of 260 off the range r, leaving about 2100. The square root of that is about 250 which is about 1015.



回答3:

The mistake being made is that the SHA code is not used to generate the contents of any files, the contents are stored by Git separately. The SHA code is just used as a key to a commit. The reason commits can't just have keys just numbered from 1 and increasing is because with Git different people can work on different branches of the same project making commits without knowing about each other. When these get merged together we still need commits to have unique keys. The best way of making it so the keys will definitely be unique is using something like SHA which creates a unique code and as others have explained the probability of getting the same key is almost zero.



标签: git sha