I am trying to extract (source code line, author label) pair from git repositories. The easiest way to do that is using git blame. The problem is that git blame takes the last committer as the author no matter whether the committer just indents the code or really changes the code. Do you know any method to it better?
Or maybe before trying to solve the problem, I should first check how many source lines are associated with multiple authors. If the percentage is small, there is no need to worry about it. But I find even counting the number is difficult. For a commit with a single parent, how can we know that the commit changed a line rather deleted a line and added a lined? For a commit with two parents (like a merge), how should I combine the diff results from the two branch?
Thanks
Overview
This is a fundamental misunderstanding of how Git works. Git does not commit patches or diffs; it commits trees and blobs, although packfiles certainly do some sort of deltification. Most of the commit history is calculated at run-time with some flavor of diff.
In other words, if your diff tools can do what you want, so can Git.
git-blame
The git-blame command won't do what you want, because the man page says (emphasis mine):
In other words, it's strictly line-oriented.
git-log
You can get close to what you want with git-log. For example:
The porcelain format is intended for text processing, but it's very non-intuitive from a visual point of view. However, it is well-documented in
man 1 git-diff
for your programming pleasure.The downside is that you will have to get your author information from the GIT_AUTHOR_NAME or GIT_COMMITTER_NAME associated with each commit, rather than having Git decorate it for you.