Occasionally I dropped a DVD-rip into a website project, then carelessly git commit -a -m ...
, and, zap, the repo was bloated by 2.2 gigs. Next time I made some edits, deleted the video file, and committed everything, but the compressed file is still there in the repository, in history.
I know I can start branches from those commits and rebase one branch onto another. But what should I do to merge together the 2 commits so that the big file didn't show in the history and were cleaned in garbage collection procedure?
I basically did what was on this answer: https://stackoverflow.com/a/11032521/1286423
(for history, I'll copy-paste it here)
It didn't work, because I like to rename and move things a lot. So some big file were in folders that have been renamed, and I think the gc couldn't delete the reference to those files because of reference in
tree
objects pointing to those file. My ultimate solution to really kill it was to:My repo (the
.git
) changed from 32MB to 388KB, that even filter-branch couldn't clean.What you want to do is highly disruptive if you have published history to other developers. See “Recovering From Upstream Rebase” in the
git rebase
documentation for the necessary steps after repairing your history.You have at least two options:
git filter-branch
and an interactive rebase, both explained below.Using
git filter-branch
I had a similar problem with bulky binary test data from a Subversion import and wrote about removing data from a git repository.
Say your git history is:
Note that
git lola
is a non-standard but highly useful alias. With the--name-status
switch, we can see tree modifications associated with each commit.In the “Careless” commit (whose SHA1 object name is ce36c98) the file
oops.iso
is the DVD-rip added by accident and removed in the next commit, cb14efd. Using the technique described in the aforementioned blog post, the command to execute is:Options:
--prune-empty
removes commits that become empty (i.e., do not change the tree) as a result of the filter operation. In the typical case, this option produces a cleaner history.-d
names a temporary directory that does not yet exist to use for building the filtered history. If you are running on a modern Linux distribution, specifying a tree in/dev/shm
will result in faster execution.--index-filter
is the main event and runs against the index at each step in the history. You want to removeoops.iso
wherever it is found, but it isn’t present in all commits. The commandgit rm --cached -f --ignore-unmatch oops.iso
deletes the DVD-rip when it is present and does not fail otherwise.--tag-name-filter
describes how to rewrite tag names. A filter ofcat
is the identity operation. Your repository, like the sample above, may not have any tags, but I included this option for full generality.--
specifies the end of options togit filter-branch
--all
following--
is shorthand for all refs. Your repository, like the sample above, may have only one ref (master), but I included this option for full generality.After some churning, the history is now:
Notice that the new “Careless” commit adds only
other.html
and that the “Remove DVD-rip” commit is no longer on the master branch. The branch labeledrefs/original/refs/heads/master
contains your original commits in case you made a mistake. To remove it, follow the steps in “Checklist for Shrinking a Repository.”For a simpler alternative, clone the repository to discard the unwanted bits.
Using a
file:///...
clone URL copies objects rather than creating hardlinks only.Now your history is:
The SHA1 object names for the first two commits (“Index” and “Admin page”) stayed the same because the filter operation did not modify those commits. “Careless” lost
oops.iso
and “Login page” got a new parent, so their SHA1s did change.Interactive rebase
With a history of:
you want to remove
oops.iso
from “Careless” as though you never added it, and then “Remove DVD-rip” is useless to you. Thus, our plan going into an interactive rebase is to keep “Admin page,” edit “Careless,” and discard “Remove DVD-rip.”Running
$ git rebase -i 5af4522
starts an editor with the following contents.Executing our plan, we modify it to
That is, we delete the line with “Remove DVD-rip” and change the operation on “Careless” to be
edit
rather thanpick
.Save-quitting the editor drops us at a command prompt with the following message.
As the message tells us, we are on the “Careless” commit we want to edit, so we run two commands.
The first removes the offending file from the index. The second modifies or amends “Careless” to be the updated index and
-C HEAD
instructs git to reuse the old commit message. Finally,git rebase --continue
goes ahead with the rest of the rebase operation.This gives a history of:
which is what you want.
You can do this using the
branch filter
command:git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD
git filter-branch --tree-filter 'rm -f path/to/file' HEAD
worked pretty well for me, although I ran into the same problem as described here, which I solved by following this suggestion.The pro-git book has an entire chapter on rewriting history - have a look at the
filter-branch
/Removing a File from Every Commit section.After trying virtually every answer in SO, I finally found this gem that quickly removed and deleted the large files in my repository and allowed me to sync again: http://www.zyxware.com/articles/4027/how-to-delete-files-permanently-from-your-local-and-remote-git-repositories
CD to your local working folder and run the following command:
replace FOLDERNAME with the file or folder you wish to remove from the given git repository.
Once this is done run the following commands to clean up the local repository:
Now push all the changes to the remote repository:
This will clean up the remote repository.
When you run into this problem,
git rm
will not suffice, as git remembers that the file existed once in our history, and thus will keep a reference to it.To make things worse, rebasing is not easy either, because any references to the blob will prevent git garbage collector from cleaning up the space. This includes remote references and reflog references.
I put together
git forget-blob
, a little script that tries removing all these references, and then uses git filter-branch to rewrite every commit in the branch.Once your blob is completely unreferenced,
git gc
will get rid of itThe usage is pretty simple
git forget-blob file-to-forget
. You can get more info herehttps://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/
I put this together thanks to the answers from Stack Overflow and some blog entries. Credits to them!