How to find out which files take up the most space

I need to make the repo smaller. I think I can make it smaller by removing problematic binary files from git history:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch BigFile'

And then releasing the objects:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --aggressive --prune=now

(Feel free to comment if those commands are wrong.)

The problem: How to identify those big files so that I can asses whether to remove them from git history? Most likely they are not in the working tree anymore - they have been deleted and probably also untracked with:

git rm --cached BigFile

标签： git

4条回答

Fickle 薄情

2楼-- · 2020-05-20 08:41

I wrote a script that will tell you the largest objects, files, or directories in my answer here. Without arguments, it'll tell you the size of all objects, sorted by size. You can tell it --sum or --directories to sum all the objects for each file and print that, or to do the same for all files in each directory. I hope it's useful!

0人赞添加讨论(0) 举报

女痞

3楼-- · 2020-05-20 08:46

Couldn't help optimizing MatrixManAtYrService's answer:

git rev-list --all --objects | git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(rest)' | grep blob | sort -k3nr | head -n 20

This way git rev-list is called only once (and not per object being displayed), and the script is more clear.

0人赞添加讨论(0) 举报

倾城　Initia

4楼-- · 2020-05-20 08:50

You can find the hash IDs of the largest objects like this:

git rev-list --all --objects | awk '{print $1}' | git cat-file --batch-check | sort -k3nr

Then, for a particular SHA, you can do this to get the file name:

git rev-list --all --objects | grep <SHA>

Not sure if there's a more efficient way to do it. If you know for sure that everything is in pack files (not loose objects), git verify-pack -v produces output that includes the size, and I seem to remember seeing a script somewhere that would parse that output and match each object back up with the original files.

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2020-05-20 08:53

twalberg's answer does the trick. I wrapped it up in a loop so that you can list files in order by size:

while read -r largefile; do
    echo $largefile | awk '{printf "%s %s ", $1, $3 ; system("git rev-list --all --objects | grep " $1 " | cut -d \" \" -f 2-")}'
done <<< "$(git rev-list --all --objects | awk '{print $1}' | git cat-file --batch-check | sort -k3nr | head -n 20)"

head -n 20 restricts the output to the top 20. Change as necessary.

Once you've identified the problem files, check out this answer for how to remove them.

0人赞添加讨论(0) 举报

How to find out which files take up the most space

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间