How to know working directory refers to which bran

2019-09-20 20:22发布

问题:

According to some researches I figured it out that git keeps two version of the code in two places:

  • .git/refs/heads (local repository)
  • .git/refs/remotes/ (working directory)

First of all, is my understanding ok?

Then I need to know both head and working directory are referring to which branch. There are two commands:

  1. cat .git/head

  2. git branch

Can you please tell me those two commands refers to which one? (either the branch that is in the head or the branch that is in the working directory) ?

And when you run git status, will your changes be compared with the version which is in the head or working directory?

回答1:

No, your understanding is not OK: the entities in .git/refs/ are references, not work-trees.

(I'm afraid this is going to long since your question is kind of unfocused.)

Definitions of terms

Git used to use the term working directory interchangeably with the term work-tree or working tree. This began to change a few years ago due to confusing with the operating-system definition of working directory, where you cd path and run pwd to print your working directory. I will stick with the term work-tree.

A Git repository comes, by default, with one (1) work-tree. The work-tree is located in the same directory as the .git sub-directory that contains the repository, so if you have a .git directory you are in the top level of this particular Git work-tree.

The primary function of a Git repository (at least as far as Git itself is concerned) is to hold commits. A commit is a complete snapshot of some set of files. The files stored in the commit are said to be versioned, and using git checkout commit-specifier to extract a commit extracts that particular version of all of the files that are in the commit, into the work-tree. Files that are stored in commits in this way are said to be version-controlled. The commit that you have checked out—extracted from the repository, into the work-tree—is the current commit.

Note that files stored in commits are in a special, compressed (often highly compressed), read-only format. They cannot be changed. Once Git stores something as a Git-format object1 inside the repository database, that thing can never be changed. It can be copied—extracted into text format, manipulated in some way, and then stored as a new and different version—but each Git object is recorded by its apparently-random hash ID, which is actually a cryptographic checksum of its data. By changing something about an internal Git object and storing it again, you simply create a new object, with a new and different hash ID.2

Git allows the work-tree to contain additional files that are not version-controlled. Such files are untracked, in Git's terminology; by logical inversion, version-controlled files must be tracked. Files that are untracked can also be ignored (files that are tracked cannot be ignored).

A Git repository also comes with one (1) index. Git's index is a key data structure that you must be aware of at all times when using Git with a work-tree. This index is so important, or so badly named originally, that it has two additional names: it is also called the staging area, and sometimes the cache. The reason it was originally called the index is that, internally, it indexes (or caches) the work-tree. That's not the important-to-you aspect of the index, though.

The main thing the index does for you is to hold a version of each file. That is, when you extracted the current commit into the work-tree, the first thing Git really did was to copy each file from the commit, into the index. (This copy is ridiculously fast and light-weight since the index holds the files in the same compressed, Git-only format and in this case actually shares the underlying Git object.) Once the file is in the index, Git can and does extract it into the work-tree, expanding it out to its normal format, so that non-Git programs—and you yourself—can view and/or modify it.

The key difference between the index copy of a file, and the current commit's copy of that same file, is that the index version can be overwritten. In addition, the fact that the file is in the index is how Git decides that the file is tracked. This gives us the true definition of a tracked file: a file is tracked if and only if it exists in the index.3

Git has multiple different things called head: there are branch heads, which are I think more properly just called branch names but Git uses the terms interchangeably, and there is a special distinguished HEAD, which should be typed in all-capitals like this. It is actually stored in a file named .git/HEAD. The (single) HEAD is always a symbolic name for the current commit. Items in .git/refs/heads/ include (but are not necessarily all of) the branch names, and items in .git/refs/remotes/ include (but are not necessarily all of) what Git calls, variously, remote-tracking branch names, remote-tracking branches, and several other misleading terms. I prefer to call these remote-tracking names. We'll have a bit more on this in a moment, but for now, remember that each of these is an instance of what Git calls a reference.


1Git has four types of objects in the main repository database: commit, tree, blob, and annotated tag. You normally don't need to care about anything but commit here. The database itself is simply a key-value store, with the keys being the hash IDs and the values being the object's data. Git can (and does) detect object corruption whenever it extract an object by its key, because the cryptographic hash of the data must match the key.

2Note that each commit has its own distinct, unique hash ID. This is not true of files: if a file's content is the same in multiple commits, that file's internal object inside Git has the same hash ID in all of those commits. This is because the hash ID is the hash of the contents. Every commit is unique in some way—they all have time stamps, for instance, so as long as your computer's clock works and you don't make more than one commit per second, every commit has a unique time stamp. Actually, there are two time stamps, one for "author" and one for "committer", but we need not worry about this yet.

3When you run git commit, Git makes the new commit using whatever is in the index / staging-area at that moment. Because the files in the index are already in the Git-only format, this makes committing very fast. In a large project with tens of thousands of files, re-compressing every tracked work-tree file takes far too long (sometimes many seconds), but using the already-compressed data from the index takes only a few milliseconds. This is why the index exists; and given that the index exists and git commit simply packages it up, this is why the presence of a file in the index is what makes the file tracked.


Branch names, references, and what it means to be "on a branch"

As we just saw, a branch name is a specific kind of reference. A Git reference contains exactly one (1) hash ID. This is true of branch names, remote-tracking names, and all other kinds of references (tag names, replacement entries from git replace, and so on).

Branch names are simply references whose full name starts with refs/heads/. This is why some of them are found in .git/refs/heads. However, branch names are intended to be case-sensitive: the branch xyzzy is different from the branch Xyzzy which is different from xyZZy, and so on. This is currently partially broken in Windows and MacOS systems that fold case, but sometimes references are stored in a file (.git/packed-refs) and then it works on Windows and MacOS.4 In the future, references are likely to be stored in a real database (adjacent to the repository object database) which will make them case-sensitive and make it impossible to find them by poking around in .git/refs/heads/.

Remote-tracking names are references whose full name starts with refs/remotes/. Their name goes on to include one more element, typically origin/. Your Git uses remote-tracking names to remember branch names, and their single corresponding hash ID, that your Git found in some other Git repository. While they are in fact your names, and you are free to change them however you like, it's best to just let your Git update them automatically to match the branch names stored in that other Git.

Note that Git typically strips off the refs/whatever/ part, so that a branch name whose full name is refs/heads/master is just referred-to as master. This is crucial in the next step.

In Git's terminology, you can either be on a branch or have a detached HEAD. When you run git checkout branchname, Git populates the index and work-tree from the commit whose hash ID is stored in the branch name branchname. It also copies that branch name into HEAD, so that .git/HEAD contains the literal string ref: refs/heads/branchname. At this point Git will claim that you are on that branch. On the other hand, if you run git checkout hash-id, Git populates the index and work-tree from the commit whose hash ID you gave—this needs to be a valid, existing commit—and then writes the hash ID itself into .git/HEAD. At this point Git will claim that you have a detached HEAD.

Hence, if your HEAD file contains a raw hash ID—is detached—Git can read .git/HEAD to get the current commit's hash ID. If your HEAD is attached, Git can read .git/HEAD to get the branch name, and then read the branch name—which may or may not be stored in .git/refs/heads/name, but is definitely stored somewhere—to get the current commit hash ID. Either way, Git can use your HEAD to find the current commit.

Whenever you make a new commit, Git:

  • Creates the commit object. Git essentially freezes all the file versions in the index to use as the snapshot. It builds a new commit with those files, with you as author and committer, with "now" as the time stamps. It uses your commit message for the commit log. And, crucially, it uses the current commit ID as the parent commit for the new commit.

  • Now that the commit exists and has its own unique hash ID, Git updates the current branch, if your HEAD is attached. That is, if .git/HEAD has a branch name in it, Git overwrites the branch name's stored hash ID with the new one. If your HEAD is detached, Git overwrites .git/HEAD with the new hash ID. Either way, HEAD continues to name the current commit, because the new commit is now the current commit.

Note that because the new commit contains exactly those files that were in the index / staging-area, once you commit, the index and the HEAD commit match, just as they did when you first ran git checkout to check out the previous commit. The work-tree does not enter this picture at all!


4Because Windows and MacOS cannot have two different files named MASTER and master, for instance, it's unwise to make branch names that differ only in case. For the same reason, it's a bad idea to have commits containing files whose name differs only in case—as was true in older Linux kernels, for instance. When you check out such a commit, your index / staging-area gets both files, e.g., README.TXT and readme.txt, but your work-tree can only hold one of them and it becomes too difficult to work with Git.


On to your questions

Then I need to know both head and working directory are referring to which branch.

There are two commands:

  1. cat .git/head
  2. git branch

As I mentioned above, the file .git/HEAD contains either the branch name (if your HEAD is attached) or a raw commit hash (if your HEAD is detached). So cat .git/HEAD—you should use all uppercase so that this will work on other systems—will tell you which branch you are on, if you are on a branch.

The git branch command by default lists your branch names—all your .git/refs/heads/ files plus any branch names that are stored elsewhere—and adds a prefix * in front of the one that is in .git/HEAD if you are on a branch. If you have a detached HEAD, git branch will include in its output listing the string * HEAD detached at ... or * HEAD detach from .... The precise details vary from one version of Git to another.

There are several more commands aimed at writing code that uses Git: git symbolic-ref will read the branch name to which HEAD is attached, and print it, or simply fail if HEAD is detached; and git rev-parse --symbolic-full-name HEAD will print the full name, e.g., refs/heads/master, if you are on a branch, or just print HEAD if HEAD is detached. Using git rev-parse --abbrev-ref HEAD you can get the short name of the branch (refs/heads/ stripped), or again HEAD if HEAD is detached.

And when you run git status, will your changes be compared with the version which is in the head or working directory?

This particular question cannot be answered the way it was asked. What git status does is to run two comparisons—two git diff --name-statuses, more or less:

  • What, if anything, is different between the HEAD commit and the index?
  • What, if anything, is different between the index and the work-tree?

The results of the first diff are changes that are staged for commit—if you commit now, using the current index, the new snapshot will differ from the old one. The results of the second diff are changes that are not staged for commit. You can use git add to copy the work-tree files over the index ones, so that the index version matches the work-tree version.

Remember, whatever is in the index is, in effect, the proposed commit. Updating the index / staging-area copies of each file changes what you are proposing to commit next.