Git Merging selected branches into a new branch, r

What exactly do I type to go from: (I also get a sense from other people that my drawings suggest I don't quite understand git - bear with me.)

               -<>-<>-<>-<>- (B)
             /            
-----master-            
             \         
               --<>-<>- (A)

where '<>' is a commit.

to this:

                    (merge A and B into C)

               --------------o-> (C, new 'clean' branch off master)
              /             /
             /-<>-<>-<>-<>-/ (B)
            //            /
-----master--            /
              \         /
               --<>-<>-/ (A)

where 'o' is a merge of A and B into C.

And will I then still be able to git check-out the branches (A) and (B)?

And/or could I do this:

               --------------o-<>-(C)
              /             /
             /-<>-<>-<>-<>-/-<>-<>-(B)
            //            /
-----master--            /
              \         /
               --<>-<>-/-<>-<>-<>-(A)

If you can, even in some round about way, could you explain? Thanks.

Let's back up a bit here and start with how simple, ordinary commits work in Git. First, let's define what a commit is. They really are pretty simple. Try, as an experiment, running:

$ git cat-file -p HEAD

This will print, on your terminal, your current commit, which will look much like this, but with different big ugly hash IDs (and of course names):

tree 142feb985388972de41ba56af8bc066f1e22ccf9
parent 62ebe03b9e8d5a6a37ea2b726d64b109aec0508c
author A U Thor <thor@example.com> 1501864272 -0700
committer A U Thor <thor@example.com> 1501864272 -0700

this is some commit

It has a commit message.

That's it—that's all you need to have a commit! There's a lot hiding away in plain sight here, though. In particular, there are the tree and parent lines, which have these big ugly hash IDs. In fact, the name HEAD acts as a stand-in for another one:

$ git rev-parse HEAD
4384e3cde2ce8ecd194202e171ae16333d241326

(again, your number will be different).

These hash IDs are the "true names" of each commit (or—as for the tree—some other Git object). These hash IDs are actually cryptographic checksums of the contents of the commit (or other object type like tree). If you know the contents—the sequence of bytes making up the object—and its type and size, you can compute this hash ID yourself, though there's no real reason to bother.

What's in a commit

As you can see from the above, a commit stores a relatively small amount of information. The actual object, this short string of text lines, goes into the Git database and gets a unique hash ID. That hash ID is its "true name": when Git wants to see what's in the commit, you give Git something that produces the ID, and Git retrieves the object itself from the Git database. Inside a commit object, we have:

A tree. This holds the source tree you saved (by git adding and eventually git commiting—the final git commit step writes out the tree first, then the commit).
A parent. This is the hash ID of some other commit. We'll come back to this in a moment.
An author and committer: these hold the name of the person who wrote the code (i.e., the author) and made the commit. They are separated in case someone sends you an email patch: then the other person is the author, but you are the committer. (Git was born in the days before collaboration sites like GitHub, so emailing patches was pretty common.) These store an email address and a time stamp too, with the time stamp in that odd numeric pair form.
A log message. This is just free-form text, whatever you want to provide. The only thing Git interprets here is the blank line separating the subject of the log message from the rest of the log message (and even then, only for formatting: git log --oneline vs git log, for instance).

Making commits, starting with a completely empty repository

Suppose we have a completely empty repository, with no commits in it. If we were to go to draw the commits, we'd just end up with a blank drawing, or blank whiteboard. So let's make the first commit, by git adding some files, such as a README, and running git commit.

This first commit gets some big ugly hash ID, but let's just call it "commit A", and draw it in:

That's the only commit. So ... what's its parent?

The answer is, it doesn't have any parent. It's the first commit, so it can't. So it doesn't have a parent line after all. This makes it a root commit.

Let's make a second commit, by making a useful file, not just a README. Then we'll git add that file and git commit. The new commit gets another big ugly hash ID, but we'll just call it B. Let's draw it in:

A <-B

If we look at B with git cat-file -p <hash for B>, we'll see that this time we have a parent line, and it shows the hash for A. We say that B "points to" A; A is B's parent.

If we make a third commit C, and look at it, we'll see that C's parent is B's hash:

A <-B <-C

So now C points to B, B points to A, and A is a root commit and doesn't point anywhere. This is how Git's commits work: each one points backwards, to its parent. The chain of backwards pointers ends when we reach the root commit.

Now, all of these internal pointers are fixed, just like everything else about a commit. You can't change anything in any commit, ever, because its big ugly hash ID is a cryptographic checksum of the contents of that commit. If you somehow managed to change something, the cryptographic checksum would change too. You'd have a new, different commit.

Since all the internal pointers are fixed (and always point backwards), we don't really have to bother drawing them:

A--B--C

suffices. But—here's where branch names and the name HEAD come in—we need to know where to start. The hash IDs look quite random, unlike our nice simple A-B-C where we know the order of the letters. If you have two IDs like:

62ebe03b9e8d5a6a37ea2b726d64b109aec0508c
3e05c534314fd5933ff483e73f54567a20c94a69

there's no telling what order they go in, at least not from the IDs. So we need to know which is the latest commit, i.e., the tip commit of some branch like master. Then we can start at the latest commit, and work backwards, following these parent links one at a time. If we can find commit C, C will let us find B, and B will let us find A.

Branch names store hash IDs

What Git does is to store the hash ID of the tip commit of a branch, in a (another) database. Instead of using hash IDs as the keys, the keys here are the branch names, and their values are not the actual objects, but rather just the hash IDs of the tip commits.

(This "database" is—at least currently—mostly just a set of files: .git/refs/heads/master is a file holding the hash ID for master. So "updating the database" just means "writing a new hash ID into the file". But this method does not work very well on Windows, since this means that master and MASTER, which are supposed to be two different branches, use the same file, which causes all kinds of problems. For now, never use two branch names that differ only in case.)

So now let's look at adding a new commit D to our series of three commits. First, let's draw in the name master:

A--B--C   <-- master

The name master holds the hash ID of C at the moment, which lets us (or Git) find C, do whatever we want with it, and use C to find B. Then we use B to find A, and then since A is a root commit, we are done. We say that master points to C.

Now we add or change some files and git commit. Git writes out a new tree as usual, and then writes a new commit D. D's parent will be C:

A--B--C   <-- master
       \
        D

and finally Git just stuffs D's hash, whatever it turns out to be, into master:

A--B--C
       \
        D   <-- master

Now master points to D, so the next time we work with master we will start with commit D, then follow D's parent arrow back to C, and so on. By pointing to D, the branch-name master now has D as its tip commit. (And of course, there's no longer a reason to draw the graph with a kink in it like this.)

We keep the arrows with the branch names, because unlike commits, the branch names move. The commits themselves can never be changed, but the branch names record whatever commit we want to call "the latest".

Multiple branches

Now let's look at making more than one branch, and why we need HEAD.

We'll keep going with our four commits-so-far:

A--B--C--D   <-- master

Now let's make a new branch, develop, using git branch develop or git checkout -b develop. Since branch names are just files (or database entries) holding hash IDs, we will make the new name develop also point to commit D:

A--B--C--D   <-- master, develop

But now that we have two or more branch names, we need to know: which branch are we on? This is where HEAD comes in.

The HEAD in Git is actually just another file, .git/HEAD, that normally contains the string ref: followed by the full name of the branch. If we're on master, .git/HEAD has ref: refs/heads/master in it. If we're on develop, .git/HEAD has ref: refs/heads/develop in it. These refs/heads/ things are the names of the files holding the tip commit hashes, so Git can read READ, get the name of the branch, then read the branch file, and get the right hash ID.

Let's draw this in, too, before we've switched to branch develop:

A--B--C--D   <-- master (HEAD), develop

and then after we switch to develop:

A--B--C--D   <-- master, develop (HEAD)

That's all that happens here! There's more stuff that happens elsewhere when switching branches, but for dealing with the graph, all that git checkout does is change the name HEAD is attached to.

Now let's make a new commit E. The new commit goes in as usual, and its new parent is whatever HEAD says, which is D, so:

A--B--C--D   <-- master, develop (HEAD)
          \
           E

Now we have to update some branch. The current branch is develop, so that's the one we update. We write E's hash ID in, and now we have:

A--B--C--D   <-- master
          \
           E   <-- develop (HEAD)

This is it—this is all there is to making branches grow in Git! We just add on a new commit to wherever HEAD is now, making the new commit's parent be the old HEAD commit. Then we move whichever branch it is to point to the new commit we just made.

Merging and merge commits

Now that we have multiple branches, let's make a few more commits on each. We'll have to git checkout each branch and make some commits to get here, but suppose we end up with this graph:

A--B--C--D--G   <-- master (HEAD)
          \
           E--F   <-- develop

We now have one extra commit on master (which is the branch we're on), and two on develop, plus the original four A-B-C-D commits that are on both branches.

(This, by the way, is a peculiar feature of Git, not found in many other version control systems. In most VCSes, the branch a commit is "on" is established when you make the commit, just like commits' parents are set in stone at that time. But in Git, the branch names are very light fluffy things that just point to one single commit: the tip of the branch. So the set of branches that some commit is "on" is determined by finding all branch names, and then following all the backwards-pointing arrows to see which commits are reachable by starting at which branch-tips. This concept of reachable matters a lot, soonish, though we won't get there in this posting. See also http://think-like-a-git.net/ for instance.)

Now let's run git merge develop to merge the develop commits back into master. Remember, we're currently on master—just look at HEAD in the drawing. So Git will use the name develop to find its tip commit, which is F, and the name HEAD to find our tip commit, which is G.

Then Git will use this graph we've been drawing to find the common merge base commit. Here, that's commit D. Commit D is where these two branches first join up again.

Git's underlying merge process is somewhat complicated and messy, but if everything goes well—and it usually does—we don't have to look any deeper into it. We can just know that Git compares commit D to commit G to see what we did on master, and compares commit D to commit F to see what they did on develop. Git then combines both sets of changes, making sure that anything done on both branches gets done exactly once.

This process, of computing and combining the change-sets, is the process of merging. More specifically it is a three-way merge (probably called that because there are three inputs: the merge base, and the two branch tips). This is what I like to call the "verb part" of merging: to merge, to do the work of a three-way merge.

The result of this merge process, this merge-as-a-verb, is a source-tree, and you know what we do with a tree, right? We make a commit! So that's what Git does next: it makes a new commit. The new commit works a whole lot like any ordinary commit. It has a tree, which is the one Git just made. It has an author, committer, and commit message. And it has a parent, which is our current or HEAD commit ... and another, second parent, which is the commit we merged-in!

Let's draw in our merge commit H, with its two backwards-pointing parent arrows:

A--B--C--D--G---H   <-- master (HEAD)
          \    /
           E--F   <-- develop

(We didn't—because it's too hard—draw in the fact that the first parent is G and the second is F, but that's a useful property later.)

As with every commit, the new commit goes into the current branch, and makes the branch name advance. So master now points to the new merge commit H. It's H that points back to both G and F.

This kind of commit, this merge commit, also uses the word "merge". In this case "merge" is an adjective, but we (and Git) often just call this "a merge", using the word "merge" as a noun. So a merge, the noun, refers to a merge commit, with merge as adjective. A merge commit is simply any commit with at least two parents.

We make a merge commit by running git merge. There is, however, a little bit of a hitch: git merge doesn't always make a merge commit. It can do the verb kind of merge without doing making the adjective kind, and in fact, it doesn't even always do the verb kind either. We can force Git to make a merge commit using git merge --no-ff, even in the case where it could skip all the work.

For the moment, we'll just use --no-ff, forcing Git to make a real merge. But we'll see first why we will need --no-ff, and then second, why we shouldn't have bothered!

Back to your problem from your question

Let's redraw your graphs my way, because my way is better. :-) You have this to start with:

          B--C--D--E   <-- branch-B
         /            
--o--o--A   <-- master
         \         
          F--G   <-- branch-A

(There's nothing labeled HEAD here because we don't know or care right now which one is HEAD, if it is even any of these.)

You now want to make a new branch, branch-C, pointing to commit A, and make that the current branch. The quickest way to do that, assuming everything is already clean, is to use:

$ git checkout -b branch-C master

which moves to (checks out into the index and work-tree) the commit identified by master (commit A), then makes a new branch branch-C pointing to that commit, then makes HEAD name branch branch-C.

          B--C--D--E   <-- branch-B
         /
--o--o--A   <-- master, branch-C (HEAD)
         \
          F--G   <-- branch-A

Now we'll run the first git merge to pick up branch-A:

$ git merge --no-ff branch-A

This will compare the current commit A to the merge-base commit, which is A again. (This is the reason we need --no-ff: the merge base is the current commit!) Then it will compare the current commit to commit G. Git will combine the changes, which means "take G", and make a new merge commit on our current branch. The name master will continue to point to A, but now I'm going to just stop drawing it altogether due to the limitations of ASCII art:

          B--C--D--E   <-- branch-B
         /
--o--o--A------H   <-- branch-C (HEAD)
         \    /
          F--G   <-- branch-A

Next, we'll merge branch-B:

$ git merge branch-B

This will compare the merge base commit A to commit H, and also compare A to E. (This time the merge base is not the current commit so we don't need --no-ff.) Git will, as usual, try to combine the changes—merge as a verb—and if it succeeds, Git will make another merge commit (merge as a noun or adjective), which we can draw like this:

          B--C--D--E   <-- branch-B
         /          \
--o--o--A------H-----I   <-- branch-C (HEAD)
         \    /
          F--G   <-- branch-A

Note that none of the other names have moved at all. Branches branch-A and branch-B still point to their original commits. Branch master still points to A (and if this were a whiteboard or paper or some such we could keep it drawn in). The name branch-C now points to the second of the two merge commits we used, since each of our merges can only point back to two commits, not to three at once.

Git does have a three-at-once kind of merge

If, for some reason, you don't like having two merges, Git does offer something called an octopus merge, that can merge more than two branch tips at once. But there's never any requirement to do an octopus merge, so I'm just mentioning it here for completeness.

What we really should be observing instead is that one of these two merges was unnecessary.

We didn't need one of the merges

We started out with git merge --no-ff branch-A, and we had to use --no-ff to prevent Git from doing what Git calls a fast forward merge. We also noted why: it's because the merge base, commit A in our drawing, was the same commit to which branch-C pointed at the time.

The way we made Git combine the "changes" going from commit A to commit A (all zero of these "changes") with the changes it found going from commit A to commit G was to use --no-ff: OK, Git, I know you can do this as a fast-forward non-merge, but I want a real merge in the end, so pretend you worked hard and make a merge commit. If we left out this option, Git would simply "slide the branch label forward", going against the direction of the internal commit arrows. We would start with:

          B--C--D--E   <-- branch-B
         /
--o--o--A   <-- master, branch-C (HEAD)
         \
          F--G   <-- branch-A

and then Git would do this:

          B--C--D--E   <-- branch-B
         /
--o--o--A   <-- master
         \
          F--G   <-- branch-A, branch-C (HEAD)

Then, when we did the second merge—for which we did not and still do not need --no-ff—Git would find the merge base A, compare A vs G, compare A vs E, combine the changes to make a new tree object, and make a new commit H out of the result:

          B--C--D-----E   <-- branch-B
         /             \
--o--o--A   <-- master  H   <-- branch-C (HEAD)
         \             /
          F-----------G   <-- branch-A

Just as before, none of the other labels move at all (and this time we can draw the name master in by stretching out the graph a bit). We get only the one merge commit H, instead of two merge commits H--I.

Why you might want `--no-ff` anyway

If we make two merges, using git merge --no-ff, the source tree we'll get, when Git combines all our changes, will be the same as the source tree we get if we allow the one fast-forward merge. But the final graph is different.

The commit graph, in Git, is the history. If you want to know what happened in the past, what you have—the thing you can look at—is the commit graph. The graph is made up of all the commits, and the commits store the author and committer names and dates and log messages. They link to the saved source trees, and provide the parent links that make up the graph.

This means that in the future, if you will want to know that you made two merges, you must make two merges now. But if in the future, you don't care how many git merge commands you ran, you can let any number of those git merge steps be fast-forward (non-merge) operations. They leave no trace in the commit graph—they just move one branch name label from one commit to another—so in the future you can't really tell if this ever happened. The graph does not store name motion; it has only the commits.