I am confused as to how SHA-1 hashes are calculated for commits, trees, and blobs. As per this article, commit hashes are calculated based on following factors:
- The source tree of the commit (which unravels to all the subtrees and blobs)
- The parent commit sha1
- The author info
- The committer info (right, those are different!)
- The commit message
Are the same factors involved for tree and blob hashes as well?
Git is sometimes called a "content-addressable filesystem". The hashes are the addresses, and they are based on the contents of the various objects. So, in order to know what the hash is based on, we only need to know the contents of the various objects.
Blobs
A blob is simply a stream of octets. Nothing more. It is akin to the concept of file content in a Unix filesystem.
So, the hash of a blob is based solely on its contents, a blob has no metadata.
Trees
A tree associates names and permissions with other objects (blobs or trees). A tree is simply a list of quadruples
(permission, type, hash, name)
. For example, a tree may look like this:Note the third entry which is itself a tree.
A tree is analogous to a directory special file in a Unix filesystem.
Again, the hash is based on the contents of the tree, which means on the names, permissions, types, and hashes of its leaves.
Commits
A commit records a snapshot of a tree in time together with some metadata and how the snapshot came to be. A commit consists of:
The hash of a commit is based on those.
Tags
Tags aren't objects in the sense above. They are not part of the object store and don't have a hash. They are references to objects. (Note: any object can be tagged, not just commits, although that is the normal use case.)
Annotated Tags
An annotated tag is different: it is part of the object store.
An annotated tag stores:
As with all other objects, the hash is calculated based on all of them and nothing more.
Signed tags
A signed tag is like an annotated tag, but adds a cryptographic signature.
Notes
Notes allow you to associate an arbitrary commit with an arbitrary Git object.
The storage of notes is a little more complicated. Actually, a note is just a commit (containing a tree containing blobs containing the contents of the note). Git creates a special branch for notes and the association between the note commit and its "annotee object" happens there. I am not familiar with exactly how.
However, since a note is just a commit, and the association happens externally, the hash of a note is just the same as any other commit.
Storage Format
The storage format contains a simple header. The content that is actually stored (and hashed) is the header followed by a NULL octet followed by the object contents.
The header contains the type and the length of the object contents, encoded in ASCII. So, the blob which contains the string
Hello, World
encoded in ASCII would look like this:And that is what is hashed and stored.
Other types of objects have a more structured format, so a tree object would start off with a header
tree <length of content in octets>\0
followed by a strictly defined, structured, serialized representation of a tree.The same for commits, and so on.
Most formats are textual formats, based on simple ASCII. For example, the size is not encoded as a binary integer, but as a decimal integer with each digit represented as the corresponding ASCII character.
Compression
After the hash is computed, the octet stream corresponding to the object including the header is compressed using zlib-deflate, and the resulting octet stream is stored in a file based on the hash; per default in the directory
Packs
The above storage format is called the loose object format, because every object is stored individually. There is a more efficient storage format (which is also used as the network transmission format), called a packfile.
Packfiles are an important speed and storage optimization, but they are rather complex, so I am not going to describe them in detail.
As a first approximation, a packfile consists of all the uncompressed objects concatenated into a single file and a second file, which contains an index of where in the packfile which object resides. The packfile as a whole is then compressed, which allows a better compression ratio, since the algorithm can also find redundancies between objects and not just within a single object. (E.g. if you have two revisions of a blob which are almost identical … which is kind of the norm in an SCM.)
It doesn't use zlib-deflate, rather it uses a binary delta compression algorithm. It also uses certain heuristics for how to place the objects in the packfile so that objects which are likely to have large similarity are placed closely together. (The delta algorithm cannot actually see the whole packfile at once, that would consume too much memory, rather it operates on a sliding window over the packfile; the heuristics try to ensure that similar objects land within the same window.) Some of those heuristics are: look at the names a tree associates with blobs and try to keep the ones with the same names close together, try to keep the ones with the same file extension close together, try to keep subsequent revisions close together and so on.
Poking around
Loose (i.e. not packed) objects are just zlib-deflated. un-deflate them and just look at them to see how they are structured. Note that the uncompressed octet stream is exactly what is being hashed; the objects are stored compressed but hashed before they are compressed.
Here's a simple Perl one-liner to un-deflate (is that inflate?) a stream:
I think that the best way to understand the content of each type of git objects is to explore them yourself.
You could do it easily using the command :
Start with the sha1 of a commit. You will get sha1s of trees, take one and apply always the same command to go all the way to finish to a blob.
You will see everything time the content that is stored in the git object in the database.
The only other thing that you should know is the that the content is prefixed by the type of object, the length of the content and then compressed.