Why is my git repository so big?

2019-01-02 16:31发布

145M = .git/objects/pack/

I wrote a script to add up the sizes of differences of each commit and the commit before it going backwards from the tip of each branch. I get 129MB, which is without compression and without accounting for same files across branches and common history among branches.

Git takes all those things into account so I would expect much much smaller repository. So why is .git so big?

I've done:

git fsck --full
git gc --prune=today --aggressive
git repack

To answer about how many files/commits, I have 19 branches about 40 files in each. 287 commits, found using:

git log --oneline --all|wc -l

It should not be taking 10's of megabytes to store information about this.

标签: git
12条回答
美炸的是我
2楼-- · 2019-01-02 17:04

Are you sure you are counting just the .pack files and not the .idx files? They are in the same directory as the .pack files, but do not have any of the repository data (as the extension indicates, they are nothing more than indexes for the corresponding pack — in fact, if you know the correct command, you can easily recreate them from the pack file, and git itself does it when cloning, as only a pack file is transferred using the native git protocol).

As a representative sample, I took a look at my local clone of the linux-2.6 repository:

$ du -c *.pack
505888  total

$ du -c *.idx
34300   total

Which indicates an expansion of around 7% should be common.

There are also the files outside objects/; in my personal experience, of them index and gitk.cache tend to be the biggest ones (totaling 11M in my clone of the linux-2.6 repository).

查看更多
浮光初槿花落
3楼-- · 2019-01-02 17:06

Some scripts I use:

git-fatfiles

git rev-list --all --objects | \
    sed -n $(git rev-list --objects --all | \
    cut -f1 -d' ' | \
    git cat-file --batch-check | \
    grep blob | \
    sort -n -k 3 | \
    tail -n40 | \
    while read hash type size; do 
         echo -n "-e s/$hash/$size/p ";
    done) | \
    sort -n -k1
...
89076 images/screenshots/properties.png
103472 images/screenshots/signals.png
9434202 video/parasite-intro.avi

If you want more lines, see also Perl version in a neighbouring answer: https://stackoverflow.com/a/45366030/266720

git-eradicate (for video/parasite.avi):

git filter-branch -f  --index-filter \
    'git rm --force --cached --ignore-unmatch video/parasite-intro.avi' \
     -- --all
rm -Rf .git/refs/original && \
    git reflog expire --expire=now --all && \
    git gc --aggressive && \
    git prune

Note: the second script is designed to remove info from Git completely (including all info from reflogs). Use with caution.

查看更多
孤独总比滥情好
4楼-- · 2019-01-02 17:08

If you want to find what files are taking up space in your git repository, run

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5

Then, extract the blob reference that takes up the most space (the last line), and check the filename that is taking so much space

git rev-list --objects --all | grep <reference>

This might even be a file that you removed with git rm, but git remembers it because there are still references to it, such as tags, remotes and reflog.

Once you know what file you want to get rid of, I recommend using git forget-blob

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

It is easy to use, just do

git forget-blob file-to-forget

This will remove every reference from git, remove the blob from every commit in history, and run garbage collection to free up the space.

查看更多
看淡一切
5楼-- · 2019-01-02 17:09

Just FYI, the biggest reason why you may end up with unwanted objects being kept around is that git maintains a reflog.

The reflog is there to save your butt when you accidentally delete your master branch or somehow otherwise catastrophically damage your repository.

The easiest way to fix this is to truncate your reflogs before compressing (just make sure that you never want to go back to any of the commits in the reflog).

git gc --prune=now --aggressive
git repack

This is different from git gc --prune=today in that it expires the entire reflog immediately.

查看更多
柔情千种
6楼-- · 2019-01-02 17:10

The git-fatfiles script from Vi's answer is lovely if you want to see the size of all your blobs, but it's so slow as to be unusable. I removed the 40-line output limit, and it tried to use all my computer's RAM instead of finishing. So I rewrote it: this is thousands of times faster, has added features (optional), and some strange bug was removed--the old version would give inaccurate counts if you sum the output to see the total space used by a file.

#!/usr/bin/perl
use warnings;
use strict;
use IPC::Open2;
use v5.14;

# Try to get the "format_bytes" function:
my $canFormat = eval {
    require Number::Bytes::Human;
    Number::Bytes::Human->import('format_bytes');
    1;
};
my $format_bytes;
if ($canFormat) {
    $format_bytes = \&format_bytes;
}
else {
    $format_bytes = sub { return shift; };
}

# parse arguments:
my ($directories, $sum);
{
    my $arg = $ARGV[0] // "";
    if ($arg eq "--sum" || $arg eq "-s") {
        $sum = 1;
    }
    elsif ($arg eq "--directories" || $arg eq "-d") {
        $directories = 1;
        $sum = 1;
    }
    elsif ($arg) {
        print "Usage: $0 [ --sum, -s | --directories, -d ]\n";
        exit 1;
    } 
}

# the format is [hash, file]
my %revList = map { (split(' ', $_))[0 => 1]; } qx(git rev-list --all --objects);
my $pid = open2(my $childOut, my $childIn, "git cat-file --batch-check");

# The format is (hash => size)
my %hashSizes = map {
    print $childIn $_ . "\n";
    my @blobData = split(' ', <$childOut>);
    if ($blobData[1] eq 'blob') {
        # [hash, size]
        $blobData[0] => $blobData[2];
    }
    else {
        ();
    }
} keys %revList;
close($childIn);
waitpid($pid, 0);

# Need to filter because some aren't files--there are useless directories in this list.
# Format is name => size.
my %fileSizes =
    map { exists($hashSizes{$_}) ? ($revList{$_} => $hashSizes{$_}) : () } keys %revList;


my @sortedSizes;
if ($sum) {
    my %fileSizeSums;
    if ($directories) {
        while (my ($name, $size) = each %fileSizes) {
            # strip off the trailing part of the filename:
            $fileSizeSums{$name =~ s|/[^/]*$||r} += $size;
        }
    }
    else {
        while (my ($name, $size) = each %fileSizes) {
            $fileSizeSums{$name} += $size;
        }
    }

    @sortedSizes = map { [$_, $fileSizeSums{$_}] }
        sort { $fileSizeSums{$a} <=> $fileSizeSums{$b} } keys %fileSizeSums;
}
else {
    # Print the space taken by each file/blob, sorted by size
    @sortedSizes = map { [$_, $fileSizes{$_}] }
        sort { $fileSizes{$a} <=> $fileSizes{$b} } keys %fileSizes;

}

for my $fileSize (@sortedSizes) {
    printf "%s\t%s\n", $format_bytes->($fileSize->[1]), $fileSize->[0];
}

Name this git-fatfiles.pl and run it. To see the disk space used by all revisions of a file, use the --sum option. To see the same thing, but for files within each directory, use the --directories option. If you install the Number::Bytes::Human cpan module (run "cpan Number::Bytes::Human"), the sizes will be formatted: "21M /path/to/file.mp4".

查看更多
闭嘴吧你
7楼-- · 2019-01-02 17:14

before doing git filter-branch & git gc you should review tags that are present in your repo. Any real system which has automatic tagging for things like continuous integration and deployments will make unwated objects still refrenced by these tags , hence gc cant remove them and you will still keep wondering why the size of repo is still so big.

The best way to get rid of all un-wanted stuff is to run git-filter & git gc and then push master to a new bare repo. The new bare repo will have the cleaned up tree.

查看更多
登录 后发表回答