What it means “changes introduced by a commit” in

2020-02-14 11:01发布

问题:

Everywhere I see this: "...cherry-pick applies changes introduced by a commit..."

I did this: created this file in master:

** File 1 **

Content

** Footer **

then branched out to branch2 and committed a change:

** File 1 **

Content
Edit 1

** Footer **

and then another one:

** File 1 **

Content
Edit 2
Edit 1

** Footer **

Now I went back to master and tried to cherry-pick the latest commit from branch2. I expected that only 'Edit2' will get imported since isn't this a change introduced by that commit, compared to the previous one?

What I got instead is the following merge conflict:

** File 1 **

Content
<<<<<<< HEAD
=======
Edit 2
Edit 1
>>>>>>> b634e53...
** Footer **

Now my obvious question is what is it that I misunderstand about how cherry-pick works, and concretely why there is a merge conflict here, which would be a fast-forward with git merge?

IMPORTANT NOTICE: This is NOT meant as a question about merge conflicts, what I'm interested in is what cherry-pick is actually doing here. And I'm not asking from curiosity/whatever, but because I'm running into troubles using git on my job.

回答1:

As several people have noted in comments (and made links to other questions), git cherry-pick actually does a three way merge. How do cherry-pick and revert work? describes this, but more in terms of content than mechanism.

I describe the source of a particular set of merge conflicts in Why do I get this merge conflict with git rebase interactive?, along with a general outline of cherry-pick and revert, but I think it's a good idea to step back and ask the mechanism question you did. I would re-frame it a bit, though, as these three questions:

  • Is a commit really a snapshot?
  • If a commit is a snapshot, how does git show or git log -p show it as a change?
  • If a commit is a snapshot, how can git cherry-pick or git revert work?

Answering the last requires first answering one more question:

  • How does Git perform git merge?

So, let's take these four questions, in the correct order. This is going to be rather long, and if you like, you can jump straight to the last section—but note that it builds on the third section, which builds on the second, which builds on the first.

Is a commit really a snapshot?

Yes—though, technically, a commit refers to a snapshot, rather than being one. This is pretty simple and straightforward. To use Git, we generally start out by running git clone, which gets us a new repository. Occasionally, we start out by making an empty directory and using git init to create an empty repository. Either way, though, we now have three entities:

  1. The repository itself, which a big database of objects, plus a smaller database of name to hash ID mappings (for, e.g., branch names), plus lots of other mini-databases implemented as single files (e.g., one per reflog).

  2. Something Git calls the index, or the staging area, or sometimes the cache. What it gets called depends on who does the calling. The index is essentially where you have Git build the next commit you will make, though it takes on an expanded role during merges.

  3. The work-tree, which is where you can actually see files and work on / with them.

The object database holds four types of objects, which Git calls commits, trees, blobs, and annotated tags. Trees and blobs are mostly implementation detail, and we can ignore annotated tags here: the main function of this big database, for our purposes, is to hold all our commits. These commits then refer to the trees and blobs that hold the files. In the end, it's actually the combination of trees-plus-blobs that is the snapshot. Still, every commit has exactly one tree, and that tree is what gets us the rest of the way to the snapshot, so except for lots of devilish implementation details, the commit itself might as well be a snapshot.

How we use the index to make new snapshots

We won't go too deep into the weeds yet, but we will say that the index works by holding a compressed, Git-ified, mostly-frozen copy of every file. Technically, it holds a reference to the actually-frozen copy, stored as a blob. That is, if you start by doing git clone url, Git has run git checkout branch as the last step of the clone. This checkout filled-in the index from the commit at the tip of branch, so that the index has a copy of every file in that commit.

Indeed, most1 git checkout operations fill in both the index and the work-tree from a commit. This lets you see, and use, all of your files in the work-tree, but the work-tree copies aren't the ones that are actually in the commit. What's in the commit is (are?) frozen, compressed, Git-ified, can-never-be-changed blob snapshots of all of those files. This keeps those versions of those files forever—or for as long as the commit itself exists—and is great for archival, but useless for doing any actual work. That's why Git de-Git-ifies the files into the work-tree.

Git could stop here, with just commits and work-trees. Mercurial—which is in many ways like Git—does stop here: your work-tree is your proposed next commit. You just change stuff in your work-tree and then run hg commit and it makes the new commit from your work-tree. This has the obvious advantage that there's no pesky index making trouble. But it also has some drawbacks, including being inherently slower than Git's method. In any case, what Git does is to start with the previous commit's information saved in the index, ready to be committed again.

Then, each time you run git add, Git compresses and Git-ifies the file you add, and updates the index now. If you change just a few files, and then git add just those few files, Git only has to update a few index entries. So this means that at all times the index has the next snapshot inside it, in the special Git-only compressed and ready-to-freeze form.

This in turn means that git commit simply needs to freeze the index contents. Technically, it turns the index into a new tree, ready for the new commit. In a few cases, such as after some reverts, or for a git commit --allow-empty, the new tree will actually be the same tree as some previous commit, but you don't need to know or care about this.

At this point, Git collects your log message and the other metadata that goes into each commit. It adds the current time as the time-stamp—this helps make sure that each commit is totally unique, as well as being generally useful. It uses the current commit as the new commit's parent hash ID, uses the tree hash ID produced by saving the index, and writes out the new commit object, which gets a new and unique commit hash ID. The new commit therefore contains the actual hash ID of whatever commit you had checked out earlier.

Last, Git writes the new commit's hash ID into the current branch name, so that the branch name now refers to the new commit, rather than to the new commit's parent, as it used to. That is, whatever commit was the tip of the branch, now that commits is one step behind the tip of the branch. The new tip is the commit you just made.


1You can use git checkout commit -- path to extract one particular file from one particular commit. This still copies the file into the index first, so that's not really an exception. However, you can also use git checkout to copy files just from the index, to the work-tree, and you can use git checkout -p to selectively, interactively patch files, for instance. Each of these variants has its own special set of rules as to what it does with index and/or work-tree.

Since Git builds new commits from the index, it may be wise—albeit painful—to re-check the documentation often. Fortunately, git status tells you a lot about what's in the index now—by comparing the current commit vs the index, then comparing the index vs the work-tree, and for each such comparison, telling you what's different. So a lot of the time, you don't have to carry around, in your head, all the wildly varying details of each Git command's effect on index and/or work-tree: you can just run the command, and use git status later.


How does git show or git log -p show a commit as a change?

Each commit contains the raw hash ID of its parent commit, which in turn means that we can always start at the last commit of some string of commits, and work backwards to find all the previous commits:

... <-F <-G <-H   <--master

We only need to have a way to find the last commit. That way is: the branch name, such as master here, identifies the last commit. If that last commit's hash ID is H, Git finds commit H in the object database. H stores G's hash ID, from which Git finds G, which stores F's hash ID, from which Git finds F, and so on.

This is also the guiding principle behind showing a commit as a patch. We have Git look at the commit itself, find its parent, and extract that commit's snapshot. Then we have Git extract the commit's snapshot too. Now we have two snapshots, and now we can compare them—subtract the earlier one from the later one, as it were. Whatever is different, that must be what changed in that snapshot.

Note that this only works for non-merge commits. When we have Git build a merge commit, we have Git store not one but two parent hash IDs. For instance, after running git merge feature while on master, we may have:

       G--H--I
      /       \
...--F         M   <-- master (HEAD)
      \       /
       J--K--L   <-- feature

Commit M has two parents: its first parent is I, which was the tip commit on master just a moment ago. Its second parent is L, which is still the tip commit on feature. It's hard—well, impossible, really—to present commit M as a simple change from either I or L, and by default, git log simply doesn't bother to show any changes here!

(You can tell both git log and git show to, in effect, split the merge: to show a diff from I to M, and then to show a second, separate diff from L to M, using git log -m -p or git show -m. The git show command produces, by default, what Git calls a combined diff, which is kind of weird and special: it's made by, in effect, running both diffs as for -m, then ignoring most of what they say and showing you only some of those changes that come from both commits. This relates pretty strongly to how merges work: the idea is to show the parts that might have had merge conflicts.)

This leads us to our embedded question, which we need to cover before we get to cherry-pick and revert. We need to talk about the mechanics of git merge, i.e., how we got a snapshot for commit M in the first place.

How does Git perform git merge?

Let's start by noting that the point of a merge—well, of most merges, anyway—is to combine work. When we did git checkout master and then git merge feature, we meant: I did some work on master. Someone else did some work on feature. I'd like to combine the work they did with the work I did. There is a process for doing this combining, and then a simpler process for saving the result.

Thus, there are two parts to a true merge that results in a commit like M above. The first part is what I like to call the verb part, to merge. This part actually combines our different changes. The second part is making a merge, or a merge commit: here we use the word "merge" as a noun or an adjective.

It's also worth mentioning here that git merge doesn't always make a merge. The command itself is complicated and has lots of fun flag arguments to control it in various ways. Here, we're only going to consider the case where it really does make an actual merge, because we're looking at merge in order to understand cherry-pick and revert.

Merge as a noun or adjective

The second part of a real merge is the easier part. Once we've finished the to merge process, the merge-as-a-verb, we have Git make a new commit in the usual way, using whatever is in the index. This means the index needs to end up with the merged content in it. Git will build the tree as usual and collect a log message as usual—we can use the not-so-good default, merge branch B, or construct a good one if we're feeling particularly diligent. Git will add our name, email address, and timestamp as usual. Then Git will write out a commit—but instead of storing, in this new commit, just the one parent, Git will store an extra, second parent, which is the hash ID of the commit we chose when we ran git merge.

For our git merge feature while on master, for instance, the first parent will be commit I—the commit we had checked out by running git checkout master. The second parent will be commit L, the one to which feature points. That's really all there is to a merge: a merge commit is just a commit with at least two parents, and the standard two parents for a standard merge are that the first is the same as for any commit, and the second is the one we picked by running git merge something.

Merge as a verb

The merge-as-a-verb is the harder part. We noted above that Git is going to make the new commit from whatever is in the index. So, we need to put into the index, or have Git put into it, the result of combining work.

We declared above that we made some changes on master, and they—whoever they are—made some changes on feature. But we already saw that Git doesn't store changes. Git stores snapshots. How do we go from snapshot to change?

We already know the answer to that question! We saw it when we looked at git show. Git compares two snapshots. So for git merge, we just need to pick the right snapshots. But which ones are the right snapshots?

The answer to this question lies in the commit graph. Before we run git merge, the graph looks like this:

       G--H--I   <-- master (HEAD)
      /
...--F
      \
       J--K--L   <-- feature

We're sitting on commit I, the tip of master. Their commit is commit L, the tip of feature. From I, we can work backwards to H and then G and then F and then presumably E and so on. Meanwhile, from L, we can work backwards to K and then J and then F and presumably E and so on.

When we do actually do this work-backwards trick, we converge at commit F. Obviously, then, whatever changes we made, we started with the snapshot in F ... and whatever changes they made, they also started with the snapshot in F! So all we have to do, to combine our two sets of changes, is:

  • compare F to I: that's what we changed
  • compare F to L: that's what they changed

We will, in essence, just have Git run two git diffs. One will figure out what we changed, and one will figure out what they changed. Commit F is our common starting point, or in version-control-speak, the merge base.

Now, to actually accomplish the merge, Git expands the index. Instead of holding one copy of each file, Git will now have the index hold three copies of each file. One copy will come from the merge base F. A second copy will come from our commit I. The last, third, copy comes from their commit L.

Meanwhile, Git also looks at the result of the two diffs, file-by-file. As long as commits F, I, and L all have all the same files,2 there are only these five possibilities:

  1. Nobody touched the file. Just use any version: they're all the same.
  2. We changed the file and they didn't. Just use our version.
  3. They changed the file and we didn't. Just use their version.
  4. We and they both changed the file, but we made the same changes. Use either ours or theirs—both are the same, so it doesn't matter which.
  5. We and they both changed the same file, but we made different changes.

Case 5 is the only tough one. For all the others, Git knows—or at least assumes it knows—what the right result is, so for all those other cases, Git shrinks the index slots for the file in question back to just one slot (numbered zero) that holds the correct result.

For case 5, though, Git stuffs all three copies of the three input files into three numbered slots in the index. If the file is named file.txt, :1:file.txt holds the merge base copy from F, :2:file.txt holds our copy from commit I, and :3:file.txt holds their copy from L. Then Git runs a low-level merge driver—we can set one in .gitattributes, or use the default one.

The default low-level merge takes the two diffs, from base to ours and from base to theirs, and tries to combine them by taking both sets of changes. Whenever we touch different lines in the file, Git takes our or their change. When we touch the same lines, Git declares a merge conflict.3 Git writes the resulting file to the work-tree as file.txt, with conflict markers if there were conflicts. If you set merge.conflictStyle to diff3, the conflict markers include the base file from slot 1, as well as the lines from the files in slots 2 and 3. I like this conflict style much better than the default, which omits the slot-1 context and shows just the slot-2 vs slot-3 conflict.

Of course, if there are conflicts, Git declares the merge conflicted. In this case, it (eventually, after processing all the other files) stops in the middle of the merge, leaving the conflict-marker mess in the work-tree and all three copies of file.txt in the index, in slots 1, 2, and 3. But if Git is able to resolve the two different change-sets on its own, it goes ahead and erases slots 1-3, writes the successfully-merged file to the work-tree,4 copies the work-tree file into the index at the normal slot zero, and proceeds with the rest of the files as usual.

If the merge does stop, it is your job to fix the mess. Many people do this by editing the conflicted work-tree file, figuring out what the right result is, writing out the work-tree file, and running git add to copy that file into the index.5 The copy-into-index step removes the stage 1-3 entries and writes the normal stage-zero entry, so that the conflict is resolved and we're ready to commit. Then you tell the merge to continue, or run git commit directly since git merge --continue just runs git commit anyway.

This to merge process, while a bit complicated, is in the end pretty straightforward:

  • Pick a merge base.
  • Diff the merge base against the current commit, the one we have checked out that we're going to modify by merging, to see what we changed.
  • Diff the merge base against the other commit, the one we picked to merge, to see what they changed.
  • Combine the changes, applying the combined changes to the snapshot in the merge base. That's the result, which goes in the index. It's OK that we start out with the merge base version, because the combined changes include our changes: we won't lose them unless we say take only their version of the file.

This to merge or merge as a verb process is then followed by the merge as noun step, making a merge commit, and the merge is done.


2If the three input commits don't have all the same files, things get tricky. We can have add/add conflicts, modify/rename conflicts, modify/delete conflicts, and so on, all of which are what I call high level conflicts. These also stop the merge in the middle, leaving slots 1-3 of the index populated as appropriate. The -X flags, -X ours and -X theirs, do not affect high level conflicts.

3You can use -X ours or -X theirs to make Git choose "our change" or "their change" instead of stopping with a conflict. Note that you specify this as an argument to git merge, so it applies to all files that have conflicts. It's possible to do this one file at a time, after the conflict happens, in a more intelligent and selective way, using git merge-file, but Git does not make this as easy as it should.

4At least, Git thinks the file is successfully merged. Git is basing this on nothing more than the two sides of the merge touched different lines of the same file and that must be OK, when that's not necessarily actually OK at all. It works pretty well in practice, though.

5Some people prefer merge tools, which generally show you all three of the input files and allow you to construct the correct merge result somehow, with the how depending on the tool. A merge tool can simply extract those three inputs from the index, since they are right there in the three slots.

How do git cherry-pick and git revert work?

These are also three-way merge operations. They use the commit graph, in a fashion similar to the way git show uses it. They are not as fancy as git merge, even though they use the merge as a verb part of the merge code.

Instead, we start with whatever commit graph you might have, e.g.:

...---o--P--C---o--...
      .      .
       .    .
        .  .
 ...--o---o---H   <-- branch (HEAD)

The actual relationship, if any, between H and P, and between H and C, is not important. The only thing that matters here is that the current (HEAD) commit is H, and that there is some commit C (the child) with a (one, single) parent commit P. That is, P and C are directly the parent-and-commit of the commit we want to pick or revert.

Since we're on commit H, that's what is in our index and work-tree. Our HEAD is attached to the branch named branch, and branch points to commit H.6 Now, what Git does for git cherry-pick hash-of-C is simple:

  • Choose commit P as the merge base.
  • Do a standard three-way merge, the merge as a verb part, using the current commit H as ours and commit C as theirs.

This merge-as-a-verb process happens in the index, just as for git merge. When it's all done successfully—or you'e cleaned up the mess, if it wasn't successful, and you've run git cherry-pick --continue—Git goes on to make an ordinary, non-merge commit.

If you look back at the merge-as-a-verb process, you'll see that this means:

  • diff commit P vs C: that's what they changed
  • diff commit P vs H: that's what we changed
  • combine these differences, applying them to what's in P

So git cherry-pick is a three-way merge. It's just that what they changed is the same thing that git show would show! Meanwhile, what we changed is everything we need to turn P into H—and we do need that, because we want to keep H as our starting point, and only add their changes to that.

But this is also how and why cherry-pick sometimes sees some strange—we think—conflicts. It has to combine the entire set of P-vs-H changes with the P-vs-C changes. If P and H are very far apart, those changes could be massive.

The git revert command is just as simple as git cherry-pick, and in fact, is implemented by the same source files in Git. All it does is use commit C as the merge base and commit P as their commit (while using H as ours as usual). That is, Git will diff C, the commit to revert, vs H, to see what we did. Then it will diff C, the commit to revert, vs P to see what they did—which is, of course, the reverse of what they actually did. Then the merge engine, the part that implements merge as a verb, will combine these two sets of changes, applying the combined changes to C and putting the result into the index and our work-tree. The combined result keeps our changes (C vs H) and undoes their changes (C vs P being a reverse-diff).

If all goes well, we end up with a perfectly ordinary new commit:

...---o--P--C---o--...
      .      .
       .    .
        .  .
 ...--o---o---H--I   <-- branch (HEAD)

The difference from H to I, which is what we will see with git show, is either a copy of the P-to-C changes (cherry-pick) or a reversal of the P-to-C changes (revert).


6Both cherry-pick and revert refuse to run unless the index and work-tree match the current commit, though they do have modes that allow them to be different. The "allowed to be different" is just a matter of tweaking expectations. and the fact that if the pick or revert fails, it may be impossible to recover cleanly. If the work-tree and index match the commit, it's easy to recover from a failed operation, so that's why this requirement exists.