How hash is calculated for commit vs tree vs blobs

2019-05-22 13:33发布

问题:

I am confused as to how SHA-1 hashes are calculated for commits, trees, and blobs. As per this article, commit hashes are calculated based on following factors:

  1. The source tree of the commit (which unravels to all the subtrees and blobs)
  2. The parent commit sha1
  3. The author info
  4. The committer info (right, those are different!)
  5. The commit message

Are the same factors involved for tree and blob hashes as well?

回答1:

Git is sometimes called a "content-addressable filesystem". The hashes are the addresses, and they are based on the contents of the various objects. So, in order to know what the hash is based on, we only need to know the contents of the various objects.

Blobs

A blob is simply a stream of octets. Nothing more. It is akin to the concept of file content in a Unix filesystem.

So, the hash of a blob is based solely on its contents, a blob has no metadata.

Trees

A tree associates names and permissions with other objects (blobs or trees). A tree is simply a list of quadruples (permission, type, hash, name). For example, a tree may look like this:

100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README
100644 blob 8f94139338f9404f26296befa88755fc2598c289 Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 lib

Note the third entry which is itself a tree.

A tree is analogous to a directory special file in a Unix filesystem.

Again, the hash is based on the contents of the tree, which means on the names, permissions, types, and hashes of its leaves.

Commits

A commit records a snapshot of a tree in time together with some metadata and how the snapshot came to be. A commit consists of:

  • a list of hashes of (any number of) parent commits (including zero)
  • a hash of a tree
  • a commit message
  • commit metadata (commit date and committer name)
  • authoring metadata (authoring date and author name)

The hash of a commit is based on those.

Tags

Tags aren't objects in the sense above. They are not part of the object store and don't have a hash. They are references to objects. (Note: any object can be tagged, not just commits, although that is the normal use case.)

Annotated Tags

An annotated tag is different: it is part of the object store.

An annotated tag stores:

  • a hash of a commit
  • a tag message
  • tagging metadata (tagger name and tagging date)

As with all other objects, the hash is calculated based on all of them and nothing more.

Signed tags

A signed tag is like an annotated tag, but adds a cryptographic signature.

Notes

Notes allow you to associate an arbitrary commit with an arbitrary Git object.

The storage of notes is a little more complicated. Actually, a note is just a commit (containing a tree containing blobs containing the contents of the note). Git creates a special branch for notes and the association between the note commit and its "annotee object" happens there. I am not familiar with exactly how.

However, since a note is just a commit, and the association happens externally, the hash of a note is just the same as any other commit.


Storage Format

The storage format contains a simple header. The content that is actually stored (and hashed) is the header followed by a NULL octet followed by the object contents.

The header contains the type and the length of the object contents, encoded in ASCII. So, the blob which contains the string Hello, World encoded in ASCII would look like this:

blob 12\0Hello, World

And that is what is hashed and stored.

Other types of objects have a more structured format, so a tree object would start off with a header tree <length of content in octets>\0 followed by a strictly defined, structured, serialized representation of a tree.

The same for commits, and so on.

Most formats are textual formats, based on simple ASCII. For example, the size is not encoded as a binary integer, but as a decimal integer with each digit represented as the corresponding ASCII character.

Compression

After the hash is computed, the octet stream corresponding to the object including the header is compressed using zlib-deflate, and the resulting octet stream is stored in a file based on the hash; per default in the directory

.git/objects/<first two characters of the hash>/<remaining hash>

Packs

The above storage format is called the loose object format, because every object is stored individually. There is a more efficient storage format (which is also used as the network transmission format), called a packfile.

Packfiles are an important speed and storage optimization, but they are rather complex, so I am not going to describe them in detail.

As a first approximation, a packfile consists of all the uncompressed objects concatenated into a single file and a second file, which contains an index of where in the packfile which object resides. The packfile as a whole is then compressed, which allows a better compression ratio, since the algorithm can also find redundancies between objects and not just within a single object. (E.g. if you have two revisions of a blob which are almost identical … which is kind of the norm in an SCM.)

It doesn't use zlib-deflate, rather it uses a binary delta compression algorithm. It also uses certain heuristics for how to place the objects in the packfile so that objects which are likely to have large similarity are placed closely together. (The delta algorithm cannot actually see the whole packfile at once, that would consume too much memory, rather it operates on a sliding window over the packfile; the heuristics try to ensure that similar objects land within the same window.) Some of those heuristics are: look at the names a tree associates with blobs and try to keep the ones with the same names close together, try to keep the ones with the same file extension close together, try to keep subsequent revisions close together and so on.

Poking around

Loose (i.e. not packed) objects are just zlib-deflated. un-deflate them and just look at them to see how they are structured. Note that the uncompressed octet stream is exactly what is being hashed; the objects are stored compressed but hashed before they are compressed.

Here's a simple Perl one-liner to un-deflate (is that inflate?) a stream:

perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'


回答2:

I think that the best way to understand the content of each type of git objects is to explore them yourself.

You could do it easily using the command :

git cat-file -p <a_sha1>

Start with the sha1 of a commit. You will get sha1s of trees, take one and apply always the same command to go all the way to finish to a blob.

You will see everything time the content that is stored in the git object in the database.

The only other thing that you should know is the that the content is prefixed by the type of object, the length of the content and then compressed.



标签: git hash