I find it hard to wrap my head around how Git creates fully unique hashes that aren't allowed to be the same even in the first 4 characters. I'm able to call commits in Git Bash using only the first four characters. Is it specifically decided in the algorithm that the first characters are "ultra"-unique and will not ever conflict with other similar hashes, or does the algorithm generate every part of the hash in the same way?
问题:
回答1:
Git uses the following information to generate the sha-1:
- The source tree of the commit (which unravels to all the subtrees and blobs)
- The parent commit sha1
- The author info
- The committer info (right, those are different!)
- The commit message
(on the complete explanation; look here).
Git does NOT guarantee that the first 4 characters will be unique. In chapter 7 of the Pro Git Book it is written:
Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous:
So Git just makes the abbreviation as long as necessary to remain unique. They even note that:
Generally, eight to ten characters are more than enough to be unique within a project.
As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.
So in fact they just depend on the great improbability of having the exact same (X first characters of a) sha.
回答2:
Apr. 2017: Beware that after the all shattered.io episode (where a SHA1 collision was achieved by Google), the 20-byte format won't be there forever.
A first step for that is to replace unsigned char sha1[20]
which is hard-code all over the Git codebase by a generic object whose definition might change in the future (SHA2?, Blake2, ...)
See commit e86ab2c (21 Feb 2017) by brian m. carlson (bk2204
).
Convert the remaining uses of
unsigned char [20]
tostruct object_id
.
That is an example of an ongoing effort started with commit 5f7817c (13 Mar 2015) by brian m. carlson (bk2204
), for v2.5.0-rc0, in cache.h
:
/* The length in bytes and in hex digits of an object name (SHA-1 value). */
#define GIT_SHA1_RAWSZ 20
#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)
struct object_id {
unsigned char hash[GIT_SHA1_RAWSZ];
};
And don't forget that, even with SHA1, the 4 first characters are no longer enough to guarantee uniqueness, as I explain in "How much of a git sha is generally considered necessary to uniquely identify a change in a given codebase?".
Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".
You will be able to use another hash: SHA1 is no longer the only one for Git.