Git disk usage per branch

2020-06-03 06:12发布

问题:

Do you know if there is a way to list the space usage of a git repository per branch ? (like df or du would)

By "the space usage" for a branch I mean "the space used by the commits which are not yet shared accross other branches of the repository".

回答1:

This doesn’t have a proper answer. If you look at the commits contained only in a specific branch, you would get a list of blobs (basically file versions). Now you would have to check whether these blobs are part of any of the commits in the other branches. After doing that you will have a list of blobs that are only part of your branch.

Now you could sum up the size of these blobs to get a result – but that would probably be very wrong. Git compresses these blobs against each other, so the actual size of a blob depends on what other blobs are in your repo. You could remove 1000 blobs, 10MB each and only free 1kb of disk space.

Usually a big repo size is caused by single big files in the repo (if not, you are probably doing something wrong :). Info on how to find those can be found here: Find files in git repo over x megabytes, that don't exist in HEAD



回答2:

Git maintains a directed acyclic graph of commits, with (in a simplistic sense) each commit using up disk space.

Unless all of your branches diverge from the very first commit, then there will be commits that are common to various branches, which means that each branch 'shares' some amount of disk space.

This makes it difficult to provide a 'per branch' figure of disk usage, as it would need to be qualified with what amount is shared, and with which other branches it is shared.



回答3:

Most of the space of your repository is taken by the blobs containing the files.

But when a blob is shared by two branches (or two files with same content) it is not duplicated. The size of the repository can't be thought as the sum of the size of the branches. There is no such concept as the space taken by a branch.

And there is a lot of compression enabling to economize space on small file modifications.

Usually cutting off a branch will free only a very small, unpredictable, space.



回答4:

As it seems that nothing like that already exists, here is a Ruby script I did for that.

#!/usr/bin/env ruby -w
require 'set'

display_branches = ARGV

packed_blobs = {}

class PackedBlob
    attr_accessor :sha, :type, :size, :packed_size, :offset, :depth, :base_sha, :is_shared, :branch
    def initialize(sha, type, size, packed_size, offset, depth, base_sha)
        @sha = sha
        @type = type
        @size = size
        @packed_size = packed_size
        @offset = offset
        @depth = depth
        @base_sha = base_sha
        @is_shared = false
        @branch = nil
    end
end

class Branch
    attr_accessor :name, :blobs, :non_shared_size, :non_shared_packed_size, :shared_size, :shared_packed_size, :non_shared_dependable_size, :non_shared_dependable_packed_size
    def initialize(name)
        @name = name
        @blobs = Set.new
        @non_shared_size = 0
        @non_shared_packed_size = 0
        @shared_size = 0
        @shared_packed_size = 0
        @non_shared_dependable_size = 0
        @non_shared_dependable_packed_size = 0
    end
end

dependable_blob_shas = Set.new

# Collect every packed blobs information
for pack_idx in Dir[".git/objects/pack/pack-*.idx"]
    IO.popen("git verify-pack -v #{pack_idx}", 'r') do |pack_list|
        pack_list.each_line do |pack_line|
            pack_line.chomp!
            if not pack_line.include? "delta"
                sha, type, size, packed_size, offset, depth, base_sha = pack_line.split(/\s+/, 7)
                size = size.to_i
                packed_size = packed_size.to_i
                packed_blobs[sha] = PackedBlob.new(sha, type, size, packed_size, offset, depth, base_sha)
                dependable_blob_shas.add(base_sha) if base_sha != nil
            else
                break
            end
        end
    end
end

branches = {}

# Now check all blobs for every branches in order to determine whether it's shared between branches or not
IO.popen("git branch --list", 'r') do |branch_list|
    branch_list.each_line do |branch_line|
        # For each branch
        branch_name = branch_line[2..-1].chomp
        branch = Branch.new(branch_name)
        branches[branch_name] = branch
        IO.popen("git rev-list #{branch_name}", 'r') do |rev_list|
            rev_list.each_line do |commit|
                # Look into each commit in order to collect all the blobs used
                for object in `git ls-tree -zrl #{commit}`.split("\0")
                    bits, type, sha, size, path = object.split(/\s+/, 5)
                    if type == 'blob'
                        blob = packed_blobs[sha]
                        branch.blobs.add(blob)
                        if not blob.is_shared
                            if blob.branch != nil and blob.branch != branch
                                # this blob has been used in another branch, let's set it to "shared"
                                blob.is_shared = true
                                blob.branch = nil
                            else
                                blob.branch = branch
                            end
                        end
                    end
                end
            end
        end
    end
end

# Now iterate on each branch to compute the space usage for each
branches.each_value do |branch|
    branch.blobs.each do |blob|
        if blob.is_shared
            branch.shared_size += blob.size
            branch.shared_packed_size += blob.packed_size
        else
            if dependable_blob_shas.include?(blob.sha)
                branch.non_shared_dependable_size += blob.size
                branch.non_shared_dependable_packed_size += blob.packed_size
            else
                branch.non_shared_size += blob.size
                branch.non_shared_packed_size += blob.packed_size
            end
        end
    end
    # Now print it if wanted
    if display_branches.empty? or display_branches.include?(branch.name)
        puts "branch: %s" % branch.name
        puts "\tnon shared:"
        puts "\t\tpacked: %s" % branch.non_shared_packed_size
        puts "\t\tnon packed: %s" % branch.non_shared_size
        puts "\tnon shared but with dependencies on it:"
        puts "\t\tpacked: %s" % branch.non_shared_dependable_packed_size
        puts "\t\tnon packed: %s" % branch.non_shared_dependable_size
        puts "\tshared:"
        puts "\t\tpacked: %s" % branch.shared_packed_size
        puts "\t\tnon packed: %s" % branch.shared_size, ""
    end
end

With that one I was able to see that in my 2Mo git repository, I'd got one useless branch which took me 1Mo of blobs not shared with any other branches.



回答5:

I had the same problem this morning and wrote a quick script:

for a in $(git branch -a | grep remotes | awk '{print $1}' | sed 's/remotes\/origin\///'); do echo -n ${a} -\ ; git clean -d -x -f > /dev/null 2>&1 ;git checkout ${a} > /dev/null 2>&1; du -hs -I --exclude-dir=.git .;done

This will checkout every remote branch after resetting their content to make sure we cleanly checkout it. Then it will display the size without the .git directory.

With this, I was able to find the person who pushed a branch with a big file in it.

Please remember to do this in another cloned directory as it will wipe out everything that is not committed



标签: git diskspace