Git repository internal format explained

2020-05-23 06:57发布

问题:

Is there any documentation on how Git stores files in his repository? I'm try to search over the Internet, but no usable results. Maybe I'm using incorrect query or maybe this is great secret — Git repository internal format?

Let me explain, why I need this rocket science information: I'm using C# to get file history form repository. But in libgit2sharp library it's not implemented currently. So (as a responsible person ;) I need to implement this feature by myself and contribute to community.

But after moving kernel sources to github I'm even don't know where start to my search.

Many thanks in advance!

回答1:

The internal format of the repository is extremely simple. Git is in essence a user space file system that's content addressable.

Here's a thumbnail sketch.

Objects

Git stores its internal data structures as objects. There are four kinds of objects: blobs (sort of like files), trees (sort of like directories), commits (snapshots of the file system at particular points in time along with information on how to reach there) and tags (pointers to commits useful for marking important ones).

If you look inside the .git directory of a repository, you'll find an objects directory that contains files named by the SHA-1 hash. Each of them represents an object. You can inspect them using plumbing git cat-file command. An example commit object from one of my repositories

noufal@sanitarium% git cat-file -p 7347addd901afc7d237a3e9c9512c9b0d05c6cf7
tree c45d8922787a3f801c0253b1644ef6933d79fd4a
parent 4ee56fbe52912d3b21b3577b4a82849045e9ff3f
author Noufal Ibrahim <noufal@..> 1322165467 +0530
committer Noufal Ibrahim <noufal@..> 1322165467 +0530

Added a .md extension to README

You can also see the the object itself at .git/objects/73/47addd901afc7d237a3e9c9512c9b0d05c6cf7.

You can examine other objects like this. Each commit points to a tree representing the file system at that point in time and has one (or more in case of merge commits) parent.

Objects are stored as single files in the objects directory. These are called loose objects. When you run git gc, objects that can no longer be reached are pruned and the remaining are packed together into a a single file and delta compressed. This is more space efficient and compacts the repository. After you run gc, you can look at the .git/objects/pack/ directory to see git packfiles. To unpack them, you can use the plumbing command git unpack-objects command. The .git/objects/info/packs file contains a list of packfiles that are currently present.

References

The next thing you need to know is what references are. These are pointers to certain commits or objects. Your branches and other such things are implemented as references. There are two kinds "real" (which are like hard links in a file system) and "symbolic" (which are pointers to real references - like symbolic links).

These are located in the .git/refs directory. For example, in the above repository, I'm on the master branch. My latest commit is

noufal@sanitarium% git log -1
commit 7347addd901afc7d237a3e9c9512c9b0d05c6cf7
Author: Noufal Ibrahim <noufal@...>
Date:   Fri Nov 25 01:41:07 2011 +0530

    Added a .md extension to README

You can see that my master reference located at .git/refs/heads/master points to this commit.

noufal@sanitarium% more .git/refs/heads/master
7347addd901afc7d237a3e9c9512c9b0d05c6cf7

The current branch is stored in the symbolic reference HEAD located at .git/HEAD. Here it is

noufal@sanitarium% more .git/HEAD
ref: refs/heads/master

It will change if you switch branches.

Similarly, tags are references like this too (but they are not movable unlike branches).

The entire repository is managed using just a DAG of commits (each of which points to a tree representing the files at a point in time) and references that point to various commits on the DAG so that you can manipulate them.

Further reading

  • I have a presentation which I use for my git trainings up here that explains some of this.
  • The community book at http://book.git-scm.com/ has some sections on the internals.
  • Scott Chacon's Pro Git book has a section on internals
  • He also has a peepcode PDF just about the internals.