How are the Git commit IDs generated to uniquely identify the commits?
Example: 521747298a3790fde1710f3aa2d03b55020575aa
How does it work? Are they only unique for each project? Or for the Git repositories globally?
How are the Git commit IDs generated to uniquely identify the commits?
Example: 521747298a3790fde1710f3aa2d03b55020575aa
How does it work? Are they only unique for each project? Or for the Git repositories globally?
A Git commit ID is a SHA-1 hash of every important thing about the commit. I'm not going to list them all, but here's the important ones...
Change any of that and the commit ID changes. And yes, the same commit with the same properties will have the same ID on a different machine. This serves three purposes. First, it means the system can tell if a commit has been tampered with. It's baked right into the architecture.
Second, one can rapidly compare commits just by looking at their IDs. This makes Git's network protocols very efficient. Want to compare two commits to see if they're the same? Don't have to send the whole diff, just send the IDs.
Third, and this is the genius, two commits with the same IDs have the same history. That's why the ID of the previous commits are part of the hash. If the content of a commit is the same but the parents are different, the commit ID must be different. That means when comparing repositories (like in a push or pull) once Git finds a commit in common between the two repositories it can stop checking. This makes pushing and pulling extremely efficient. For example...
origin
A - B - C - D - E [master]
A - B [origin/master]
The network conversation for git fetch origin
goes something like this...
local
Hey origin, what branches do you have?origin
I have master at E.local
I don't have E, I have your master at B.origin
B you say? I have B and it's an ancestor of E. That checks out. Let me send you C, D and E.This is also why when you rewrite a commit with rebase, everything after it has to change. Here's an example.
A - B - C - D - E - F - G [master]
Let's say you rewrite D, just to change the log message a bit. Now D can no longer be D, it has to be copied to a new commit we'll call D1.
A - B - C - D - E - F - G [master]
\
D1
While D1 can have C as its parent (C is unaffected, commits do not know their children) it is disconnected from E, F and G. If we change E's parent to D1, E can't be E anymore. It has to be copied to a new commit E1.
A - B - C - D - E - F - G [master]
\
D1 - E1
And so on with F to F1 and G to G1.
A - B - C - D - E - F - G
\
D1 - E1 - F1 - G1 [master]
They all have the same code, just different parents (or in D1's case, a different commit message).
You can see exactly what goes into making a commit id by running
git cat-file commit HEAD
It will give you something like
tree 07e239f2f3d8adc12566eaf66e0ad670f36202b5
parent 543a4849f7201da7bed297b279b7b1e9a086a255
author Justin Howard <justin.howard@example.com> 1426631449 -0700
committer Justin Howard <justin.howard@example.com> 1426631471 -0700
My commit message
It gives you:
Git takes all this and does a sha1 hash of it. You can reproduce the commit id by running
(printf "commit %s\0" $(git cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
This starts out by printing the string commit
followed by a space and the byte count of the cat-file
text blob. It then adds the cat-file
blob to that followed by a null byte. All of that then gets run through sha1sum
.
As you can see, there is nothing that identifies the project or repository in this information. The reason that this doesn't cause problems is because it is astronomically unlikely for two different commit hashes to collide.