To be clear on this question, I am not asking about how to remove a single file from history, like this question: Completely remove file from all Git repository commit history. I am also not asking about untracking files from gitignore, like in this question: Ignore files that have already been committed to a Git repository.
I am talking about "updating a .gitignore file, and subsequently removing everything matching the list from history", more or less like this question: Ignore files that have already been committed to a Git repository. However, unfortunately, the answer from that question does not work for this purpose, so I am here to try elaborating the question and hopefully find a good answer that does not involve a human looking through an entire source tree to manually do a filter-branch on each matched file.
Here I provide a test script, currently performing the procedure in the answer of Ignore files that have already been committed to a Git repository. It is going to remove and create a folder root
under PWD, so be careful before running it. I will describe my goal after the code.
#!/bin/bash -e
TESTROOT=${PWD}
GREEN="\e[32m"
RESET="\e[39m"
rm -rf root
mkdir -v root
pushd root
mkdir -v repo
pushd repo
git init
touch a b c x
mkdir -v main
touch main/{a,x,y,z}
# Initial commit
git add .
git commit -m "Initial Commit"
echo -e "${GREEN}Contents of first commit${RESET}"
git ls-files | tee ../00-Initial.txt
# Add another commit just for demo
touch d e f y z main/{b,c}
## Make some other changes
echo "Test" | tee a | tee b | tee c | tee x | tee main/a > main/x
git add .
git commit -m "Some edits"
echo -e "${GREEN}Contents of second commit${RESET}"
git ls-files | tee ../01-Changed.txt
# Now I want to ignore all 'a' and 'b', and all 'main/x', but not 'main/b'
## Checkout the root commit
git checkout -b temp $(git rev-list HEAD | tail -1)
## Add .gitignores
echo "a" >> .gitignore
echo "b" >> .gitignore
echo "x" >> main/.gitignore
echo "!b" >> main/.gitignore
git add .
git commit --amend -m "Initial Commit (2)"
## --v Not sure if it is correct
git rebase --onto temp master
git checkout master
## --v Now, why should I delete this branch?
git branch -D temp
echo -e "${GREEN}Contents after rebase${RESET}"
git ls-files | tee ../02-Rebased.txt
# Supposingly, rewrite history
git filter-branch --tree-filter 'git clean -f -X' -- --all
echo -e "${GREEN}Contents after filter-branch${RESET}"
git ls-files | tee ../03-Rewritten.txt
echo "History of 'a'"
git log -p a
popd # repo
popd # root
This code creates a repository, adds some files, do some edit, and perform the cleaning procedure. Also, some log files are generated. Ideally, I would like a
, b
, and main/x
disappear from history, while main/b
stays. However, right now nothing is removed from history. What should be modified to perform this goal?
Bonus points if this can be done on multiple branches. But for now, keep it to a single master branch.
Achieving the result you want is a bit tricky. The simplest way, using git filter-branch
with a --tree-filter
, will be very slow. Edit: I've modified your example script to do this; see the end of this answer.
First, let's note one constraint: you can never change any existing commit. All you can do is make new commits that look a lot like the old ones, but "new and improved". You then direct Git to stop looking at the old commits, and look only at the new ones. This is what we will do here. (Then, if required, you can force Git to really forget the old commits. The easiest way is to re-clone the clone.)
Now, to re-commit every commit that is reachable from one or more branch and/or tag names, preserving everything except that which we explicitly tell it to change,1 we can use git filter-branch
. The filter-branch command has a rather dizzying array of filtering options, most of which are meant to make it go faster, because copying every commit is pretty slow. If there are just a few hundred commits in a repository, with a few dozens or hundreds of files each, it's not so bad; but if there are about 100k commits holding about 100k files each, that's ten thousand million files (10,000,000,000 files) to examine and re-commit. It is going to take a while.
Unfortunately there is no easy and convenient way to speed this up. The best way to speed it up would be to use an --index-filter
, but there is no built in index filter command that will do what you want. The easiest filter to use is --tree-filter
, which is also the slowest one there is. You might want to experiment with writing your own index filter, perhaps in shell script or perhaps in another language you prefer (you will still need to invoke git update-index
either way).
1Signed annotated tags cannot be preserved intact, so their signatures will be stripped. Signed commits may have their signatures become invalid (if the commit hash changes, which depends on whether it must: remember that the hash ID of a commit is the checksum of the commit's contents, so if the set of files changes, the checksum changes; but if the checksum of a parent commit changes, the checksum of this commit also changes).
Using --tree-filter
When you use git filter-branch
with --tree-filter
, what the filter-branch code does is to extract each commit, one at a time, into a temporary directory. This temporary directory has no .git
directory and is not where you are running git filter-branch
(it's actually in a subdirectory of the .git
directory unless you use the -d
option to redirect Git to, say, a memory filesystem, which is a good idea for speeding it up).
After extracting the entire commit into this temporary directory, Git runs your tree-filter. Once your tree-filter finishes, Git packages up everything in that temporary directory into the new commit. Whatever you leave there, is in. Whatever you add to there, is added. Whatever you modify there, is modified. Whatever you remove from there, is no longer in the new commit.
Note that a .gitignore
file in this temporary directory has no effect on what will be committed (but the .gitignore
file itself will be committed, since whatever is in the temporary directory becomes the new copy-commit). So if you want to be sure that a file of some known path is not committed, simply rm -f known/path/to/file.ext
. If the file was in the temporary directory, it is now gone. If not, nothing happens and all is well.
Hence, a workable tree filter would be:
rm -f $(cat /tmp/files-to-remove)
(assuming no white space issues in file names; use xargs ... | rm -f
to avoid white space issues, with whatever encoding you like for the xargs input; -z
style encoding is ideal since \0
is forbidden in path names).
Converting this to an index filter
Using an index filter lets Git skip the extract-and-examine phases. If you had a fixed "remove" list in the right form, it would be easy to use.
Let's say you have the file names in /tmp/files-to-remove
in a form that is suitable for xargs -0
. Your index filter might then read, in its entirety:
xargs -0 /tmp/files-to-remove | git rm --cached -f --ignore-unmatch
which is basically the same as the rm -f
above, but works within the temporary index Git uses for each commit-to-be-copied. (Add -q
to the git rm --cached
to make it quiet.)
Applying .gitignore
files in a tree filter
Your example script tries to use a --tree-filter
after rebasing onto an initial commit that has the desired items:
git filter-branch --tree-filter 'git clean -f -X' -- --all
There is one initial bug though (the git rebase
is wrong):
-git rebase --onto temp master
+git rebase --onto temp temp master
Fixing that, the thing still doesn't work, and the reason is that git clean -f -X
only removes files that are actually ignored. Any file that is already in the index, is not actually ignored.
The trick is to empty out the index. However, this does too much: git clean
then never descends into subdirectories—so the trick comes in two parts: empty out the index, then re-fill it with non-ignored files. Now git clean -f -X
will remove the remaining files:
-git filter-branch --tree-filter 'git clean -f -X' -- --all
+git filter-branch --tree-filter 'git rm --cached -qrf . && git add . && git clean -fqX' -- --all
(I added several "quiet" flags here).
To avoid needing to rebase in the first place to install initial .gitignore
files, let's say you have a master set of .gitignore
files you want in every commit (which we'll then use in the tree filter as well). Simply place these, and nothing else, in a temporary tree:
mkdir /tmp/ignores-to-add
cp .gitignore /tmp/ignores-to-add
mkdir /tmp/ignores-to-add/main
cp main/.gitignore /tmp/ignores-to-add
(I'll leave working up a script that finds and copies just .gitignore
files to you, it seems moderately annoying to do without one). Then, for the --tree-filter
, use:
cp -R /tmp/ignores-to-add . &&
git rm --cached -qrf . &&
git add . &&
git clean -fqX
The first step, cp -R
(which can be done anywhere before the git add .
, really), installs the correct .gitignore
files. Since we do this to each commit, we never need to rebase before running filter-branch
.
The second removes everything from the index. (A slightly faster method is just rm $GIT_INDEX_FILE
but it's not guaranteed that this will work forever.)
The third re-adds .
, i.e., everything in the temporary tree. Since the .gitignore
files are in place, we only add non-ignored files.
The last step, git clean -qfX
, removes work-tree files that are ignored, so that filter-branch
won't put them back.
On windows this sequence did not work to me:
cp -R /tmp/ignores-to-add . &&
git rm --cached -qrf . &&
git add . &&
git clean -fqX
But following works.
Update every commit with existed .gitignore:
git filter-branch --index-filter '
git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q
' -- --all
Update .gitignore in the every commit and filter files:
cp ../.gitignore /d/tmp-gitignore
git filter-branch --index-filter '
cp /d/tmp-gitignore ./.gitignore
git add .gitignore
git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q
' -- --all
rm /d/tmp-gitignore
Use grep -v
if you had special cases, for example file empty
to keep empty directory:
git ls-files -i --exclude-from=.gitignore | grep -vE "empty$" | xargs git rm --cached -q