Difference between creating a branch and doing a s

2019-06-02 19:22发布

问题:

Say the history of my commits is A - B - C and I have only this branch.

B was fully working. I started adding some functionality in C, but it's not working so I need to go back to B, but I also want to retain the code I wrote in C because I will want to review it and fix it later. What is the best way to do it?

Is the best way to create a new branch starting from B?

What is the difference between that and doing a soft reset? I understand a soft reset doesn't delete the changes (is that correct?) but it's not clear to me how to restore those changes (the code in C), nor what the difference between a soft reset and creating a branch is.

Aside

Git just seems needlessly arcane and obscure. I mean, the official docs define push as:

https://git-scm.com/docs/git-push

git-push - Update remote refs along with associated objects

I am sure it is technically correct, but it is hardly the most user-friendly explanation. Could they have added a comment explaining it uploads the local repository to the remote one, or something like that?

回答1:

All the answers here are OK. What's missing is, well ... this is where your rant comes in. :-) Your professor-quote here is quite apposite:

One of my best professors at uni always said: beware of those who try to dumb down very complex concepts, but also beware of complexity for its own sake: those who cannot explain a simple concept in a simple way either want to show off or do not really understand the concept themselves!

Or, as Einstein supposedly put it, "Make everything as simple as possible, but no simpler."

Unfortunately, what Git does—distributed source code control—is inherently complex. Fortunately, there are some simple ways to get started. Unfortunately, traditional books, and Git's documentation itself, do this not-so-well, in my opinion. The Pro Git book is, I think, pretty good (and has the advantage of generally being up-to-date), and there are some other books that are unfortunately terribly out of date now that were pretty good, but most introductions try to start without a proper foundation.

The foundation requires some terminology as well. This is probably where the Git manual pages fail the hardest. They just spray terminology—sometimes inconsistent terminology, although this has improved over time—all over the place. This has led to some pretty funny web pages. (I think a lot of introductions to Git shy away from terminology because the core foundation of Git lies in graph theory and hashing theory, and people find the mathematical aspect of those scary.)

Git itself makes things harder than necessary. A simple existence proof of that is Mercurial. Mercurial and Git are, at least in terms of what they do with source code, equally powerful—but those new to distributed source control have far fewer problems getting started in Mercurial than they do in Git. It's not 100% clear why that is, but I think there are two key things Mercurial does differently that produce this result:

  • In Mercurial, branches are global and permanent. This is very convenient for beginning work but, at least sometimes, proves to be a trap. Mercurial eventually added bookmarks that work like Git's branches.

  • Mercurial does not have the thing that Git calls the index.

These aren't the only things—Git has a lot of other, smaller annoyances as well that just aren't there in Mercurial—but I think they are the big two. For instance, the entire question of git reset doesn't occur in Mercurial because git reset (a) manipulates branch pointers—Mercurial has those bookmarks instead, if you choose to use them—and (b) manipulates the index that Mercurial doesn't even have.

My own answer: what's going on

Anyway, the key here is these three things. (Here comes some terminology!)

  1. In Git, a branch name is little more than a name-to-hash-ID mapping. What matters are the commits.

  2. A commit is a unique entity, identified by a unique hash ID like b5101f929789889c2e536d915698f58d5c5c6b7a, that stores—permanently1 and unchangably—a snapshot of files and some metadata, including the hash ID(s) of some other commit(s).

  3. The index is the area that Git actually uses to build new commits.


1Well, as permanent as the commit, anyway. Commits eventually go away if there's no way to find them. This is where branch names and graph theory come in—but we'll get to that later.


What to know about the index

Let's just start with this observation: when a commit stores a snapshot of all of your files, it keeps them in a compressed, read-only, Git-only storage form. They're sort of frozen or freeze-dried, as it were. No one can change them at all. That's fine for archival—saving old source code—but completely useless for getting any new work done.

To get work done, you need a place where your files are unfrozen, rehydrated, readable and writable, in their normal everyday form. That place is what Git calls your work-tree. Git could stop here—frozen commits and flexible work-tree—and that's what Mercurial does and it works fine. But for whatever reason, Git adds this thing it calls the index, or sometimes the staging area, or even the cache. (The name used depends on who / what is doing the naming, but all three are the same thing. Also, the index itself is more complicated than I'll go into, but we don't need to worry about these complications here.)

What the index stores is Git-ified copies of files. They're not exactly frozen, but they are in the same format—the freeze-dried format, as it were. They're not useful to you; they're only useful to Git. Why it does this is debatable, but it does this, and you need to know about it. What it does with this is that the index is how Git makes new commits.

When you run:

git commit -m "this is a terrible log message"

Git will package up whatever is in the index right now, along with your metadata—your name and email address and the log message and so on—and turn that into a new commit. The stuff in your work-tree, where you're doing your work, is entirely irrelevant! The fact that everything is already prepared—already freeze-dried, as it were—is what makes git commit so fast. Mercurial's hg commit, which commits what's in your work-tree, has to check every file in your work-tree to see if it's the same as the previous one or not, and if not, prepare the freeze-dried form for the commit. So in a big project you run hg commit and then go out for coffee or whatever.2 But with Git, if you change a file in the work-tree, Git makes you run:

git add file

every time. This copies the file—while freeze-drying or Git-ify-ing it—into the index.

Hence, the index always contains the next commit you're proposing to make. If you make some changes to the work-tree, and want them in your next commit, you have to explicitly copy them into the index before you run git commit. You can use git commit -a to have Git scan your work-tree and do the adds for you, making Git act the way Mercurial would if you were using Mercurial. That's certainly convenient and lets you not think about the index, or even pretend it's not there. But I think it's a bad plan because then git reset becomes inexplicable.


2It's usually not that bad, and in a small project the difference is nearly undetectable. Mercurial uses a lot of cache tricks to speed this up as much as it can, but—unlike Git—it keeps those out of the way of the user.


Commits

Now let's look closely what what, exactly, goes into a commit. I think the best way to see this is to look at an actual commit. You can look at your own with:

git cat-file -p HEAD

but I'll show this one from the Git repository for Git like this:

$ git cat-file -p b5101f929789889c2e536d915698f58d5c5c6b7a | sed 's/@/ /'
tree 3f109f9d1abd310a06dc7409176a4380f16aa5f2
parent a562a119833b7202d5c9b9069d1abb40c1f9b59a
author Junio C Hamano <gitster pobox.com> 1548795295 -0800
committer Junio C Hamano <gitster pobox.com> 1548795295 -0800

Fourth batch after 2.20

Signed-off-by: Junio C Hamano <gitster pobox.com>

Note the tree and parent lines, which refer to additional hash IDs. The tree line represents the saved source code snapshot. It might not be unique! Suppose you make a commit, then later, go back to an old version but save that as a new commit on purpose. The new commit can re-use the original commit's tree, and Git will do just that, automatically. This is one of many tricks Git has up its sleeves for compressing archived snapshots.

The parent line, though, is how Git commits become a graph. This particular commit is b5101f929789889c2e536d915698f58d5c5c6b7a. The commit that comes before this commit is a562a119833b7202d5c9b9069d1abb40c1f9b59a, which is a merge commit:

$ git cat-file -p a562a119833b7202d5c9b9069d1abb40c1f9b59a | sed 's/@/ /'
tree 9e2e07ce274b0a5a070d837c865f6844b1dc0de8
parent 7fa92ba40abbe4236226e7d91e664bbeab8c43f2
parent ad6f028f067673cadadbc2219fcb0bb864300a6c
author Junio C Hamano <gitster pobox.com> 1548794876 -0800
committer Junio C Hamano <gitster pobox.com> 1548794877 -0800

Merge branch 'it/log-format-source'

Custom userformat "log --format" learned %S atom that stands for
the tip the traversal reached the commit from, i.e. --source.

* it/log-format-source:
  log: add %S option (like --source) to log --format

This commit has two parent lines, giving two more commits. That's what makes this a merge commit in the first place.

What all this means is that if we throw out the notion of looking at the source code (we can bring it back any time by using the tree lines from each commit—every commit has one), we can view the commits themselves as just a linked series of nodes in a graph, each with its own unique hash ID, each of which remembers the hash ID of some predecessor or parent nodes.

We can draw these like this:

A <-B <-C

for a simple three-commit repository, or:

...--I--J--M--N
  \       /
   K-----L

for a more complicated repository with a merge as the parent of the last commit (on the right). We use one uppercase letter to stand in for the actual, apparently-random hash ID, because hash IDs are unwieldy (but single letters are pretty wieldy). The arrows, or connecting lines, from a child commit back to its parent(s) are the parent lines in the actual commit.

Remember, again, that all these commits are frozen in time, forever. We cannot change any aspect of any of them. We can of course make a new commit (from the index as usual). If we don't like commit C or commit N, we can make a replacement for it, e.g.:

     D
    /
A--B--C

Then we can bend C out of the way and use D instead:

A--B--D
    \
     C

These are the same graph, we're just looking at it differently.

Branch names (and other names but we won't cover them here)

These graph drawings are neat and simple and, I'll argue, the way to reason about your Git repository. They show the commits, and they hide the ugly hash IDs from us. But Git does actually need the hash IDs—that's how Git retrieves the commits—and we're going to need to remember the last hash ID of any one of these chains. The reason we only need the last one should be obvious now: if we grab hold of, say, commit D, well, commit D stores the actual hash ID of commit B inside itself. So once we know D's hash, we use D to find B. Then we use B to find A, and—since A is the very first commit and therefore has no parent—we can stop and rest.

So we need one more addition to our drawing here. What we need is a branch name. The name simply points to (i.e., contains the actual hash ID of) the last commit! We can draw this as:

A--B--D   <-- master
    \
     C

The name, master, holds the hash ID of the last commit. From there we find the previous commits. What Git stores for us is:

  • all of the commits, by hash ID
  • some set of names, each of which holds one hash ID

and that—except for all the complications with index and work-tree—is how Git works. To make a new commit E, we just snapshot the index, add the metadata (our name, email address, etc) including the hash ID of commit D, and write that into the commit database:

        E
       /
A--B--D   <-- master
    \
     C

and then have Git automatically update the name master to point to the new commit we just made:

        E   <-- master
       /
A--B--D
    \
     C

Now we can straighten out the kink:

A--B--D--E   <-- master
    \
     C

What about poor lonely commit C, though? It has no name. It has some actual big ugly hash ID, but how, without a name or memorizing that hash ID, will we ever find commit C?

The answer is that Git will eventually delete C entirely unless we give it a name. The obvious name to use is another branch name, so let's do that:

A--B--D--E   <-- master
    \
     C   <-- dev

Now we have two branches, master and dev. The name master means "commit E" and the name dev means "commit C", at the moment. As we work with the repository and add new commits to it, the hash IDs stored under these two names will change. This leads to our key observation: In Git, the commits are permanent (mostly) and unchangeable (entirely), but the branch names move. Git stores the graph—these chains of commits with their internal arrows connecting them, in this backwards-looking fashion—for us. We can add to it any time we want, by adding more commits. And, Git stores a name-to-hash-ID mapping table for us, with branch names holding the hash ID of starting points (or ending points?) in the graph.

The Git terminology for those starting / ending points is tip commit. The branch name identifies the tip commit.

HEAD, and git checkout and the index and the work-tree

Now that we have more than one branch name in our repository, we need some way to remember which branch we're using. This is the main function of the special name HEAD. In Git, we use git checkout to select some existing branch name, such as master or dev:

$ git checkout dev

results in:

A--B--D--E   <-- master
    \
     C   <-- dev (HEAD)

By attaching the name HEAD to a branch name like dev, Git knows which branch we're working on now.

As a crucial side effect, Git also:

  • copies all the files from C into the index, ready for the next commit, and
  • copies all the files from C/the-index into the work-tree, so we can see and use them.

Git may also need to remove some files, if we were on commit E and it has files that aren't there in C. It will remove them from both the index and the work-tree. As usual, Git makes sure that all three copies of every file match up. If there is a file named README in commit C, for instance, we have:

  • HEAD:README: this is the frozen Git-ified copy in commit C, now accessible under the special name HEAD.
  • :README: this is the index copy. It matches the HEAD:README at the moment, but we can overwrite it with git add.
  • README: this is a regular file. We can work with it. Git doesn't really care very much about that—we'll need to copy it back into :README if we change it!

So, with one action—git checkout master or git checkout dev—we:

  • re-attach HEAD;
  • fill the index; and
  • fill the work-tree

and are now ready to work, git add files to copy them back into the index, and git commit to make a new snapshot that adds to the branch and makes the branch name refer to the new commit. Let's make a new commit F on dev:

... edit some file(s) including README ...
git add README                    # or git add ., or git add -u, etc
git commit -m "another terrible log message"

and now we'll have:

A--B--D--E   <-- master
    \
     C--F   <-- dev (HEAD)

Git knows to update dev, not master, because HEAD is attached to dev, not master. Note, too, that since we made commit F from whatever is in our index right now, and we just made the index match the work-tree, now F, the index, and the work-tree all match up. That's just what we'd have if we just now ran git checkout dev!

This is where git reset comes in

Except for the special case of an unreachable commit that eventually gets deleted, the graph itself can only be added-to. The branch names, however, we can move around any time we like. The main command for doing this is git reset.

Suppose, for instance, that commit F is awful—it's a mistake, we just want to forget it entirely. What we need to do is move the name dev so that instead of pointing to F, it points to C again—F's parent.

We can find the hash ID of commit C and, rudely, just write that directly into the branch name. But if we do that, what about our index and work-tree? They'll still match the contents of commit F. We'll have the graph:

A--B--D--E   <-- master
    \
     C   <-- dev (HEAD)
      \
       F

but the index and work-tree won't match C. If we run git commit again we'll get a commit that looks almost exactly the same as F—it will share the tree, and just have a different date stamp and maybe a better log message. But maybe that's what we want! Maybe we wanted to just fix our terrible log message. In that case, making a new G from the current index would be the answer.

That's what git reset --soft does: it lets us move the branch name to point to a different commit, without changing the index and work-tree. We discard F, then make a new G that's just like F but has the right message. F has no name and eventually withers away.

But what if we just wanted to get rid of F entirely? Then we'd want the index and work-tree to match commit C. We'll let F wither away as before. But to get the index and work-tree to match C, we need git reset --hard.

Because the index and work-tree are separate entities, we can choose to go halfway. We can move the name dev to point to C, replace the index contents with those from C, but leave the work-tree alone. That's what git reset --mixed does, and git reset --mixed is actually the default for git reset so we don't even need the --mixed part.

All three of these actions have different end-goals: git reset --soft was for re-do the commit, git reset --hard was for throw away the commit entirely, and git reset --mixed doesn't have a clear usage in this particular example. So why are they all spelled git reset? That's where your rant applies again: they probably shouldn't be. They're related in that Git has these three things it can do with branch-name-to-commit-hash, and index and work-tree contents:

  1. move the branch name
  2. replace or keep the index contents
  3. replace or keep the work-tree contents

and git reset will either do step 1 and stop (git reset --soft), or do steps 1 and 2 and stop (git reset --mixed / the default), or do all three and stop (git reset --hard). But their purposes aren't related: Git is confusing mechanism ("how we get from here to there") with goal ("get to there").

Conclusion

Say the history of my commits is A - B - C and I have only this branch.

OK:

A--B--C   <-- branch (HEAD)

I need to go back to B, but I also want to retain the code I wrote in C

OK. Clearly what we'll want is one name identifying commit B and another one identifying commit C. But we also need to concern ourselves with the index and work-tree!

There's only one index and one work-tree,3 and those aren't copied by git clone. Only the commits are permanent. So if you have anything unsaved in your index and/or your work-tree, you probably should save it now. (By committing, probably—and you can use git stash to make commits that aren't on any branch, but let's not go there, at least not yet.) Let's assume you don't, so as to remove the question entirely.

The graph won't change. You just need to add a new name. There are lots of ways to do that, but for illustration, let's do it this way: let's start by creating a new branch name that also points to commit C, which we'll call save. To do that, we'll use git branch, which can create new names pointing to existing commits:

$ git branch save

The default for where the new name points is to use the current commit (via HEAD and the current branch name), so now we have:

A--B--C   <-- branch (HEAD), save

HEAD has not moved: it's still attached to branch, which still points to C. Note that both branches identify the same commit C, and all three commits are on both branches.4

Now that we have the name save saving the hash ID of C, we're free to move the name branch to point to commit B. To do that, we'll use git reset. We'd like to have our index and work-tree match commit B too, so we want git reset --hard—which will replace our index and work-tree, which is why it was important to make sure we didn't need to save anything from them:

$ git reset --hard <hash-of-B>

giving:

A--B   <-- branch (HEAD)
    \
     C   <-- save

There are, of course, a lot of other options. For instance, we could leave branch pointing to C and create a new name pointing to B:

A--B   <-- start-over
    \
     C   <-- branch (HEAD)

and to do that we could use:

$ git branch start-over <hash-of-B>

Since we didn't move HEAD, there is no need to disturb the index and work-tree in any way. If we had uncommitted work we could now run git add if needed (to update the index if needed) and git commit to make a new commit D that would have C as its parent.


3This is actually not true. There's one main work-tree and it has one main index. You can create as many temporary index files as you like, and since Git 2.5, you can add auxiliary work-trees whenever you like. Each added work-tree has its own separate index—the index indexes / caches the work-tree, after all—and its own HEAD so that each can, and in fact must, be on a different branch. But again, that's not something you need to worry about yet. Creating a temporary index is really just for special-purpose actions: for instance, that's how git stash commits your current work-tree without messing with other things.

4This is where Git and Mercurial differ enormously: in Mercurial, every commit is on exactly one branch, where it remains forever. You literally can't make two branch names that identify the same commit. Mercurial also doesn't use this branch name equals tip commit and other commits are implied by walking the graph trick.


There's a trick for hash IDs

I'm just going to mention this in passing here. Above, we had a lot of cases where you probably have to run git log and cut-and-paste big ugly hash IDs. We already know that a name, like a branch name, lets us use the name instead of the ID. Instead of writing out the hash ID of C as pointed-to by branch or by save, we can just use the name:

git show save

for instance will extract commit C, then extract commit B, compare the two, and show us what's different in the snapshots in B and C. But we can go one better:

git show save~1

means: Find commit C. Then, step back one parent link. That's commit B. So git show will now extract the snapshots in B and its parent A, compare the two, and show us what we changed in B. The tilde ~ and hat ^ characters can be used as suffixes on any revision specifier. The complete description of how to specify revisions (commits or commit ranges, mostly) is documented in the gitrevisions manual. There are a lot of ways to do it!


Some years ago, I tried my hand at starting a book that would use both Git and Mercurial as a way to get people started with graph-and-hash-based distributed source code control. Unfortunately most of the work on that happened between jobs, and I haven't been between-jobs for years now, so it's stalled and getting stale. But for those who want to see what's there, it's here.



回答2:

If you want to keep the commit C as a commit and work from B again, you will need a new branch. I would suggest doing a new branch from C and hard-resetting master (or whatever your main working branch is) to B.

You will then be left with this (I added a new commit for clarity):

D      master
| C    review-c branch
|/
B
|
A

A soft reset to B will remove the commit C and stage the changes you made in C (same as if you made the changes for C yourself and did a git add of them).



回答3:

In your case, I suggest that 1) you create a branch or a tag at C so that you can review it when you want to, and 2) reset your branch to B so that the changes of C are exluded from the branch.

# create a branch from C
git branch foo C

# create a tag at C
git branch bar C

# reset your branch to B
git checkout <your_branch>
git reset B --hard

# review C
git show foo
git show bar

To some degree, a tag is more stable than a branch. foo may move to another commit by an accidential command. bar always points at C unless you intend to move it to another commit. bar is considered as an alias of C.



回答4:

<rant-adressing on>

You have awoken the Ancients. Be very afraid.

<rant-adressing off>

;-)


For the problem at hand, if you want to keep changes in C for later but keep on working from the B state which is better, you have a number of ways to do that. I'd do a new branch* and reset the old one.

branch creation / hard reset method

# create your backup branch for these failed changes
git checkout -b to-be-reviewed

# take your branch to its previous state
git checkout -
git reset --hard HEAD^

But you could also, as you considered, use git reset --soft. If you want to compare uses, here's how you could have proceeded with it :

branch creation / soft reset method

# undo last commit (C) but keep the changes in the working tree
git reset --soft HEAD^

# create a new branch and commit on it
git checkout -b to-be-reviewed
git commit -m "Your message"

Then your original branch points at B and to-be-reviewed has all recent (though non-working) changes.


Lastly, this is a use-case for git stash :

no new branch / stash method

# reset your branch to state B
git reset --soft HEAD^

# stash your changes with a title for easier reuse
git stash save "Failed changes XYZ"

And at this point you can later inspect this stash with git stash list / git stash show.

* (As ElpieKay suggests very appropriately, tags could be considered instead of branches for this use. Overall reasonning is the same when it comes to resets anyway)