Git Filter-Branch All command

2020-07-14 05:49发布

问题:

At the moment, I'm currently using the command, "git filter-branch --subdirectory-filter MY_DIRECTORY -- --all" to grab a certain directory from all the 30 branches in this git repo. Before I do this filter branch command I make sure to checkout every branch to make sure the --all commands works properly.

My question is, do I have to checkout every branch before I do a git-filter all or will git-filter all still work without having to checkout all the 30 branches I'm looking at? Right now each branch is almost 3GB so this entire checkout process is taking a very long time. Any clarification would be great!

回答1:

Before we start

Before I dive into the answer itself, note that if you want to have a local branch name for each of your remote-tracking names, you can simply create that local branch name without using git checkout:

git branch -t develop origin/develop
git branch -t feature/X origin/feature/X
git branch -t foo origin/foo

and so on. This is a subset of what git checkout does, and is very fast since creating new branch names just means writing one file.

(If you like, you can use this technique and stop here, but the rest of this answer should be quite useful.)

The short and long answer

The short answer is that you do not have to check out (or create new) branch names. But you will need to understand more than that to use Git (including this particular git filter-branch operation) well.

Let's start with this: --all here means all references. But what's a "reference" then?

Well, any branch name is a reference. But so is any tag name. The special name refs/stash, used by git stash, is a reference. Remote-tracking names are references. Notes refs (from git notes) are references. For more about this and other Git terms, see the gitglossary (note that this particular entry is under ref rather than reference).

When you first use git clone to clone a repository, you are telling your own Git: make a new, independent copy of some existing repository, at the URL I give you, so that I can do my own work and then share it or not as I please. But their repository—whoever "they" are at the URL—has its own branch names. They have their master, which is not always going to be the same as your master. So your Git renames their names: their master becomes your origin/master, and so on. These remote-tracking names are references.

After git clone finishes copying over to your repository all of their commits, and renaming all of their names to your remote-tracking names, the last step of git clone is to check out a branch. But you don't have any branches yet. This is where a special trick that git checkout does comes in: if you ask Git to checkout, by name, a branch that doesn't exist, Git looks through all of your remote-tracking names. If one of those matches up, Git will create a local branch name—a new reference—that points to the same commit as this remote-tracking name.

Hence, your repository has some series of commits, all of which link to each other in a backwards fashion:

first  <--next ... <--almost-last  <--last

(if they're all linear, which they almost never are) which we can draw as:

A--B--...--H--I

where each uppercase letter represents a commit. A set of commits with some "branchy-ness" (branchiness?) might look like:

     C--D
    /
A--B
    \
     E--F--G

and if there are merge commits, which point backwards to two previous commits instead of just one, it will be even more complicated.

The names we care most about here—branch names and remote-tracking names in particular—serve as a way for Git to find the last commit:

...--H--I   <-- origin/master

The name origin/master is said to point to commit I. When your Git creates your own master, your master now also points to I:

...--H--I   <-- master, origin/master

If you create your own new commit on master, this is what happens:

...--H--I   <-- origin/master
         \
          J   <-- master

Git makes up a new ID for the new commit—it's some apparently-random big ugly hash ID, but here we just call it J—and then changes your name master to point to this new commit.

If you run git fetch and bring in new commits from origin and they have updated their master, you now get:

...--H--I--K   <-- origin/master
         \
          J   <-- master

and now your master and their origin/master have diverged.

These names, master and origin/master, have the important effect of making their commits reachable. That is, by following the arrow from each name, Git can find commits J and K. Then, using the backwards arrow—really the commit's parent commit hash ID—from J to I or from K to I, Git can find commit I. Using the backwards arrow from I itself, Git can find H, and so on, all the way back to the very first commit, where the action stops.

All unreachable commits—those not found by starting at all of these starting (ending?) points and walking backwards—will be removed at some point, so they effectively don't exist. For the purposes of most Git commands that walk through the graph, that's the case as well. (There are some special purpose recovery tricks that let you get deleted commits back for 30 days, but filter-branch does not honor these.)

What all this means for filter-branch

The job of git filter-branch is to copy commits. It walks through the graph, using the starting (ending?) points you give it to find all reachable commits. It saves their hash IDs in a temporary file. Then, going in the opposite direction—i.e., forwards in time instead of Git's usual backwards—it extracts each of these commits. That is, it checks it out, so that all the files in that snapshot are available. Then filter-branch applies the filter(s), and then makes a new commit from the resulting files. So if your filter makes a simple change, the result is a copy of the original graph:

A--B--C------G--H   <-- master, origin/master
    \       /
     D--E--F

becomes:

A'-B'-C'-----G'-H'  <-- master, origin/master
    \       /
     D'-E'-F'

What happens to the original commits? Well, they are still there: what filter-branch does with the names that found them is to rename them, using refs/original/ in front of their internal full names:

A--B--C------G--H   <-- refs/original/refs/heads/master, refs/original/refs/remotes/origin/master
    \       /
     D--E--F

One reason filter-branch has so many filter options is that this process is dreadfully slow. It takes a long time to extract every file into a temporary directory. So some filters can work without extracting the files at all, which goes much (much!) faster.

Another reason is that sometimes we don't want to copy every commit, we only want to copy some commits that meet some criteria. That's the case for the --subdirectory-filter: it only copies a commit if it changes files (with respect to its parent commit(s)) that involve the subdirectory in question. So it can skip extracting a lot of commits, in some cases. Of course, the subdirectory filter also renames the files along the way, as it extracts-and-recommits, to remove the subdirectory path. The result is that a larger commit graph is copied to a newer, smaller one:

A--B--C------G--H   <-- master
    \       /
     D--E--F

might become:

B'--G'--H'   <-- master
 \ /
  E'

The retained refs/original/refs/heads/master will still point to commit H, while the rewritten refs/heads/master will point to copied commit H'. Note that the first commit in the new graph is B', not A', since A' did not have the subdirectory in question.

There's also a very important side question here: Which reference(s) does filter-branch update after it finishes all the commit-copying? The answer is in the documentation:

The command will only rewrite the positive refs mentioned in the command line (e.g. if you pass a..b, only b will be rewritten).

Since you are using --all, this will rewrite all the origin/* remote-tracking names. (--all counts as a positive mention of every ref here. There is some extra trickiness with tags: if you want to rewrite your tags, add --tag-name-filter cat as a filter.)

Summary

After your filter-branch operation, you have a series of refs/original/* names that point to the original (pre-filtering) commits, renamed from their original full names. You have a series of new updated references, including all your branch names (refs/heads/*) and remote-tracking names (refs/remotes/*) pointing to the last of whichever commits got copied.

The new repository will be bigger than the original, because it contains the original, plus the copied commits. See the checklist for shrinking a repository section of the git filter-branch documentation, near the end. But note that if you use git clone to copy the filtered repository, that copies only your branch names, not your remote-tracking names, so at this point, if you did not already create a branch for each remote-tracking name, you should do that now.

Alternatively, you can just keep the copied repository in place after removing all the refs/original/ namespace names. You can then git checkout develop to create your own refs/heads/develop based on your (filtered) refs/remotes/origin/develop, and so on. All you are doing is creating new names—the commits themselves are what Git really cares about, and they're referenced by the rewritten remote-tracking names—and then checking out that particular commit, so that it's in your index and work-tree. (The git branch -t commands we showed at the beginning created the names without copying the commits to index-and-work-tree.)