How to filter history based on gitignore?

2019-02-11 00:57发布

问题:

To be clear on this question, I am not asking about how to remove a single file from history, like this question: Completely remove file from all Git repository commit history. I am also not asking about untracking files from gitignore, like in this question: Ignore files that have already been committed to a Git repository.

I am talking about "updating a .gitignore file, and subsequently removing everything matching the list from history", more or less like this question: Ignore files that have already been committed to a Git repository. However, unfortunately, the answer from that question does not work for this purpose, so I am here to try elaborating the question and hopefully find a good answer that does not involve a human looking through an entire source tree to manually do a filter-branch on each matched file.

Here I provide a test script, currently performing the procedure in the answer of Ignore files that have already been committed to a Git repository. It is going to remove and create a folder root under PWD, so be careful before running it. I will describe my goal after the code.

#!/bin/bash -e

TESTROOT=${PWD}
GREEN="\e[32m"
RESET="\e[39m"

rm -rf root
mkdir -v root
pushd root

mkdir -v repo
pushd repo
git init

touch a b c x 
mkdir -v main
touch main/{a,x,y,z}

# Initial commit
git add .
git commit -m "Initial Commit"
echo -e "${GREEN}Contents of first commit${RESET}"
git ls-files | tee ../00-Initial.txt

# Add another commit just for demo
touch d e f y z main/{b,c}
## Make some other changes
echo "Test" | tee a | tee b | tee c | tee x | tee main/a > main/x
git add .
git commit -m "Some edits"

echo -e "${GREEN}Contents of second commit${RESET}"
git ls-files | tee ../01-Changed.txt

# Now I want to ignore all 'a' and 'b', and all 'main/x', but not 'main/b'
## Checkout the root commit
git checkout -b temp $(git rev-list HEAD | tail -1)
## Add .gitignores
echo "a" >> .gitignore
echo "b" >> .gitignore
echo "x" >> main/.gitignore
echo "!b" >> main/.gitignore
git add .
git commit --amend -m "Initial Commit (2)"
## --v Not sure if it is correct
git rebase --onto temp master
git checkout master
## --v Now, why should I delete this branch?
git branch -D temp
echo -e "${GREEN}Contents after rebase${RESET}"
git ls-files | tee ../02-Rebased.txt

# Supposingly, rewrite history
git filter-branch --tree-filter 'git clean -f -X' -- --all
echo -e "${GREEN}Contents after filter-branch${RESET}"
git ls-files | tee ../03-Rewritten.txt

echo "History of 'a'"
git log -p a

popd # repo

popd # root

This code creates a repository, adds some files, do some edit, and perform the cleaning procedure. Also, some log files are generated. Ideally, I would like a, b, and main/x disappear from history, while main/b stays. However, right now nothing is removed from history. What should be modified to perform this goal?

Bonus points if this can be done on multiple branches. But for now, keep it to a single master branch.

回答1:

Achieving the result you want is a bit tricky. The simplest way, using git filter-branch with a --tree-filter, will be very slow. Edit: I've modified your example script to do this; see the end of this answer.

First, let's note one constraint: you can never change any existing commit. All you can do is make new commits that look a lot like the old ones, but "new and improved". You then direct Git to stop looking at the old commits, and look only at the new ones. This is what we will do here. (Then, if required, you can force Git to really forget the old commits. The easiest way is to re-clone the clone.)

Now, to re-commit every commit that is reachable from one or more branch and/or tag names, preserving everything except that which we explicitly tell it to change,1 we can use git filter-branch. The filter-branch command has a rather dizzying array of filtering options, most of which are meant to make it go faster, because copying every commit is pretty slow. If there are just a few hundred commits in a repository, with a few dozens or hundreds of files each, it's not so bad; but if there are about 100k commits holding about 100k files each, that's ten thousand million files (10,000,000,000 files) to examine and re-commit. It is going to take a while.

Unfortunately there is no easy and convenient way to speed this up. The best way to speed it up would be to use an --index-filter, but there is no built in index filter command that will do what you want. The easiest filter to use is --tree-filter, which is also the slowest one there is. You might want to experiment with writing your own index filter, perhaps in shell script or perhaps in another language you prefer (you will still need to invoke git update-index either way).


1Signed annotated tags cannot be preserved intact, so their signatures will be stripped. Signed commits may have their signatures become invalid (if the commit hash changes, which depends on whether it must: remember that the hash ID of a commit is the checksum of the commit's contents, so if the set of files changes, the checksum changes; but if the checksum of a parent commit changes, the checksum of this commit also changes).


Using --tree-filter

When you use git filter-branch with --tree-filter, what the filter-branch code does is to extract each commit, one at a time, into a temporary directory. This temporary directory has no .git directory and is not where you are running git filter-branch (it's actually in a subdirectory of the .git directory unless you use the -d option to redirect Git to, say, a memory filesystem, which is a good idea for speeding it up).

After extracting the entire commit into this temporary directory, Git runs your tree-filter. Once your tree-filter finishes, Git packages up everything in that temporary directory into the new commit. Whatever you leave there, is in. Whatever you add to there, is added. Whatever you modify there, is modified. Whatever you remove from there, is no longer in the new commit.

Note that a .gitignore file in this temporary directory has no effect on what will be committed (but the .gitignore file itself will be committed, since whatever is in the temporary directory becomes the new copy-commit). So if you want to be sure that a file of some known path is not committed, simply rm -f known/path/to/file.ext. If the file was in the temporary directory, it is now gone. If not, nothing happens and all is well.

Hence, a workable tree filter would be:

rm -f $(cat /tmp/files-to-remove)

(assuming no white space issues in file names; use xargs ... | rm -f to avoid white space issues, with whatever encoding you like for the xargs input; -z style encoding is ideal since \0 is forbidden in path names).

Converting this to an index filter

Using an index filter lets Git skip the extract-and-examine phases. If you had a fixed "remove" list in the right form, it would be easy to use.

Let's say you have the file names in /tmp/files-to-remove in a form that is suitable for xargs -0. Your index filter might then read, in its entirety:

xargs -0 /tmp/files-to-remove | git rm --cached -f --ignore-unmatch

which is basically the same as the rm -f above, but works within the temporary index Git uses for each commit-to-be-copied. (Add -q to the git rm --cached to make it quiet.)

Applying .gitignore files in a tree filter

Your example script tries to use a --tree-filter after rebasing onto an initial commit that has the desired items:

git filter-branch --tree-filter 'git clean -f -X' -- --all

There is one initial bug though (the git rebase is wrong):

-git rebase --onto temp master
+git rebase --onto temp temp master

Fixing that, the thing still doesn't work, and the reason is that git clean -f -X only removes files that are actually ignored. Any file that is already in the index, is not actually ignored.

The trick is to empty out the index. However, this does too much: git clean then never descends into subdirectories—so the trick comes in two parts: empty out the index, then re-fill it with non-ignored files. Now git clean -f -X will remove the remaining files:

-git filter-branch --tree-filter 'git clean -f -X' -- --all
+git filter-branch --tree-filter 'git rm --cached -qrf . && git add . && git clean -fqX' -- --all

(I added several "quiet" flags here).

To avoid needing to rebase in the first place to install initial .gitignore files, let's say you have a master set of .gitignore files you want in every commit (which we'll then use in the tree filter as well). Simply place these, and nothing else, in a temporary tree:

mkdir /tmp/ignores-to-add
cp .gitignore /tmp/ignores-to-add
mkdir /tmp/ignores-to-add/main
cp main/.gitignore /tmp/ignores-to-add

(I'll leave working up a script that finds and copies just .gitignore files to you, it seems moderately annoying to do without one). Then, for the --tree-filter, use:

cp -R /tmp/ignores-to-add . &&
    git rm --cached -qrf . &&
    git add . &&
    git clean -fqX

The first step, cp -R (which can be done anywhere before the git add ., really), installs the correct .gitignore files. Since we do this to each commit, we never need to rebase before running filter-branch.

The second removes everything from the index. (A slightly faster method is just rm $GIT_INDEX_FILE but it's not guaranteed that this will work forever.)

The third re-adds ., i.e., everything in the temporary tree. Since the .gitignore files are in place, we only add non-ignored files.

The last step, git clean -qfX, removes work-tree files that are ignored, so that filter-branch won't put them back.



回答2:

On windows this sequence did not work to me:

cp -R /tmp/ignores-to-add . &&
git rm --cached -qrf . &&
git add . &&
git clean -fqX

But following works.

Update every commit with existed .gitignore:

git filter-branch --index-filter '
  git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q 
' -- --all

Update .gitignore in the every commit and filter files:

cp ../.gitignore /d/tmp-gitignore
git filter-branch --index-filter '
  cp /d/tmp-gitignore ./.gitignore
  git add .gitignore
  git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q 
' -- --all
rm /d/tmp-gitignore

Use grep -v if you had special cases, for example file empty to keep empty directory:

git ls-files -i --exclude-from=.gitignore | grep -vE "empty$" | xargs git rm --cached -q