Why does Git use the SHA1 of the *compressed* obje

2019-06-15 12:42发布

问题:

I'm just curious as to why this choice was made - it basically rules out changing the compression algorithm used by Git - because it doesn't use the SHA1 of the raw blobs. Perhaps there is some efficiency consideration here. Maybe ZLIB is faster at compressing a file than the SHA1 algorithm is at creating the hash, so therefore compressing before hashing is faster?

Here is a link to the original Git READMEby Linus: http://git.kernel.org/?p=git/git.git;a=blob;f=README;h=27577f76849c09d3405397244eb3d8ae1d11b0f3;hb=e83c5163316f89bfbde7d9ab23ca2e25604af290

And here is the relavent paragraph:

"There are several kinds of objects in the content-addressable collection database. They are all in deflated with zlib, and start off with a tag of their type, and size information about the data. The SHA1 hash is always the hash of the compressed object, not the original one."

回答1:

Like you said, it is the original README, when Git was started. Since then, it has been changed so that the SHA1 is computed before compressing.

It’s worth noting that the SHA-1 hash that is used to name the object is the hash of the original data plus this header, so 'sha1sum' file does not match the object name for file. (Historical note: in the dawn of the age of git the hash was the SHA-1 of the compressed object.)

http://schacon.github.com/git/user-manual.html#object-details