How to find all unmerged commits in master grouped

2019-01-23 13:27发布

站内文章 / 前沿技术

27 0

三岁会撩人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have to create some code review from unmerged branches.

In finding solutions, let's not go to local-branch context problem as this will run on a server; there will be just the origin remote, I will always run a git fetch origin command before other commands, and when we talk about branches, we will refer to origin/branch-name.

If the setup were simple and each branch that originated from master continued on its own way, we could just run:

git rev-list origin/branch-name --not origin/master --no-merges

for each unmerged branch and add the resulting commits to each review per branch.

The problem arises when there are merges between 2-3 branches and work is continued on some of them. As I said, for each branch I want to create code reviews programmatic and I don't want to include a commit in multiple reviews.

Mainly the problems reduce on finding the original branch for each commit.
Or to put it simpler... finding all unmerged commits grouped by the branch they most probably were created on.

Let's focus on a simple example:

      *    b4 - branch2's head
   *  |    a4 - branch1's head
   |  *    b3
   *  |    merge branch2 into branch1
*  |\ |    m3 - master's head
|  * \|    a3
|  |  |
|  |  *    b2
|  *  |    merge master into branch1
* /|  |    m2
|/ |  *    merge branch1 into branch2
|  * /|    a2
|  |/ |
|  |  *    b1
|  | /
|  |/
| /|
|/ |
|  *       a1
* /        m1
|/
|
*          start

and what I want to obtain is:

branch1: a1, a2, a3, a4
branch2: b1, b2, b3, b4

The best solution I found so far is to run:

git show-branch --topo-order --topics origin/master origin/branch1 origin/branch2

and parse the result:

* [master] m3
 ! [branch1] a4
  ! [branch2] b4
---
  + [branch2] b4
  + [branch2^] b3
 +  [branch1] a4
 ++ [branch2~2] b2
 -- [branch2~3] Merge branch 'branch1' into branch2
 ++ [branch2~4] b1
 +  [branch1~2] a3
 +  [branch1~4] a2
 ++ [branch1~5] a1
*++ [branch2~5] m1

Output interpretation is like this:

First n lines are the n branches analyzed
one line with ----
one line for each commit with a plus (or minus in case of merge commits) on the n-th indentation character if that commit is on the n-th branch.
the last line is the merge base for all branches analyzed

For point 3. the commit name resolution is starting with a branch name and, from what I see, this branch corresponds to the branch that commits were created on, probably by promoting path reaching by first-parent.

As I'm not interested in merge commits, I'll ignore them.

I'll then parse each branch-path-commit to obtain their hash with rev-parse.

How can I handle this situation?

回答1:

The repository could be cloned with --mirror which creates a bare repository that can be used as a mirror of the original repository and can be updated with git remote update --prune after which all the tags should be deleted for this feature.

I implement it this way:
1. get a list of branches not merged into master

git branch --no-merged master

2. for each branch get a list of revisions on that branch and not in master branch

git rev-list branch1 --not master --no-merges

If the list is empty, remove the branch from the list of branches
3. for each revision, determine the original branch with

git name-rev --name-only revisionHash1

and match regex for ^([^\~\^]*)([\~\^].*)?$. The first pattern is the branch name, the second is the relative path to the branch.
If the branch name found is not equal to the initial branch, remove revision from the list.

At the end I obtained a list of branches and for each of them a list of commits.

After some more bash research, it can be done all in one line with:

git rev-list --all --not master --no-merges | xargs -L1 git name-rev | grep -oE '[0-9a-f]{40}\s[^\~\^]*'

The result is an output in the form

hash branch

which can be read, parsed, ordered, group or whatever.

回答2:

If I grasp your problem space, think you can use --sha1-name

git show-branch --topo-order --topics --sha1-name origin/master origin/branch1 origin/branch2

to list what you are interested in, then run the commits through git-what-branch

git-what-branch: Discover what branch a commit is on, or how it got to a named branch. This is a Perl script from Seth Robertson

and format the report to suite your needs?

回答3:

There is no correct answer to this question because it is underspecified.

Git history is simply a directed acyclic graph (DAG), and it's generally impossible to determine semantic relationships between two arbitrary nodes in a DAG unless the nodes are sufficiently labeled. Unless you can guarantee that the commit messages in your example graph follow a reliable, machine-parseable pattern, the commits are not sufficiently labeled—it's impossible to automatically identify the commits you are interested in without additional context (e.g., guarantees that your developers follow certain best practices).

Here's an example of what I mean. You say that commit a1 is associated with branch1, but this can't be determined with certainty just by looking at the nodes of your example graph. It's possible that once upon a time your example repository history looked like this:

      *    merge branch1 into branch2 - branch2's head
      |\
     _|/
    / *    b1
   |  |
   |  |
  _|_/
 / |
|  *       a1
* /        m1
|/
|
*          start - master's head

Note that branch1 doesn't even exist yet in the above graph. The above graph could have arisen from the following sequence of events:

branch2 is created at start in the shared repository
user#1 creates a1 on his/her local branch2 branch
meanwhile, user#2 creates m1 and b1 on his/her local branch2 branch
user#1 pushes his/her local branch2 branch to the shared repository, causing the branch2 ref in the shared repository to point to a1
user#2 tries to push his/her local branch2 branch to the shared repository, but this fails with a non-fast-forward error (branch2 currently points to a1 and can't be fast-forwarded to b1)
user#2 runs git pull, merging a1 into b1
user#2 runs git commit --amend -m "merge branch1 into branch2" for some inexplicable reason
user#2 pushes, and the shared repository history ends up looking like the above DAG

Some time later, user#1 creates branch1 off of a1 and creates a2, while user#2 fast-forward merges m1 into master, resulting in the following commit history:

      *    merge a1 into b1 - branch2's head
   *  |\   a2 - branch1's head
   | _|/
   |/ *    b1
   |  |
   |  |
  _|_/
 / |
|  *       a1
* /        m1 - master's head
|/
|
*          start

Given that this sequence of events is technically possible (although unlikely), how can a human let alone Git tell you which commits "belong" to which branch?

Parsing Merge Commit Messages

If you can guarantee that users don't change merge commit messages (they always accept the Git default), and that Git has never and will never change the default merge commit message format, then the merge commit's commit message can be used as a clue that a1 started off on branch1. You'll have to write a script to parse the commit messages—there are no simple Git one-liners to do this for you.

If Merges are Always Intentional

Alternatively, if your developers follow best practices (each merge is intentional and is meant to bring in a differently-named branch, resulting in a repository without those stupid merge commits created by git pull), and you are not interested in the commits from a completed child branch, then the commits you're interested in are on the first-parent path. If you know which branch is the parent of the branch you are analyzing, you can do the following:

git rev-list --first-parent --no-merges parent-branch-ref..branch-ref

This command lists the SHA1 identifiers for the commits that are reachable from branch-ref excluding the commits reachable from parent-branch-ref and the commits that were merged in from child branches.

In your example graph above, assuming parent order is determined by your annotations and not by the order of the lines going into a merge commit, git rev-list --first-parent --no-merges master..branch1 would print the SHA1 identifiers for commits a4, a3, a2, and a1 (in that order; use --reverse if you want the opposite order), and git rev-list --first-parent --no-merges master..branch2 would print the SHA1 identifiers for commits b4, b3, b2, and b1 (again, in that order).

If Branches Have Clear Parent/Child Relationships

If your developers do not follow best practices and your branches are littered with those stupid merges created by git pull (or an equivalent operation), but you have clear parent/child branch relationships, then writing a script to perform the following algorithm may work for you:

Find all commits reachable from the branch of interest excluding all commits from its parent branch, its parent's parent branch, its parent's parent's branch, etc., and save the results. For example:
```
git rev-list master..branch1 >commit-list
```
Do the same for all child, grandchild, etc. branches of the branch of interest. For example, assuming branch2 is considered to be a child of branch1:
```
git rev-list ^master ^branch1 branch2 >commits-to-filter-out
```
Filter out the results of step #2 from the results of step #1. For example:
```
grep -Fv -f commits-to-filter-out commit-list
```

The trouble with this approach is that once a child branch is merged into its parent, those commits are considered to be part of the parent even if development on the child branch continues. Although this makes sense semantically, it does not produce the result you say you want.

Some Best Practices

Here are some best practices to make this particular problem easier to solve in the future. Most if not all of these can be enforced via clever use of hooks in the shared repository.

Only one task per branch. Multiple tasks are prohibited.
NEVER permit development to continue on a child branch once it has been merged to its parent. Merging implies that a task is done, end of story. Answers to anticipated questions:
- Q: What if I discover a bug in the child branch? A: Start a new branch off of the parent. Do NOT continue development on the child branch.
- Q: What if the new feature isn't done yet? A: Then why did you merge the branch? Perhaps you merged a complete subtask; if so, the remaining subtasks should go on their own branches off of the parent branch. Do NOT continue development on the child branch.
Forbid the use of git pull
A child branch must not be merged into its parent unless all of its children branches have been merged into it.
If the branch does not have any children branches, consider rebasing it onto the parent branch before merging with --no-ff. If it does have children branches, you can still rebase, but please preserve the --no-ff merges of the children branches (this is trickier than it should be).
Merge the parent branch into the child branch frequently to make merge conflicts easier to resolve.
Avoid merging a grandparent branch directly into its grandchild branch—merge into the child first, then merge the child into the grandchild.

If all of your developers follow these rules, then a simple:

git rev-list --first-parent --no-merges parent-branch..child-branch

is all you need to see the commits that were made on that branch minus the commits made on its children branches.

回答4:

I would suggest doing it kind of the way you described it. But I would work on the output of git log --format="%H:%P:%s" ^origin/master origin/branch1 origin/branch2, so you can do better tree-walking.

Build a proper tree structure from the output, marking parents and children.
Start walking from the heads (get their SHAs from git rev-parse). Mark every commit with the names of the head you came from and its distance.
- For not-first-parent steps (the other part of the merge), I would add 100 to the distance.
- If you meet a merge commit, check what it says about which branch was merged into which. Use this information when following the two parent links: If the parsed name of the branch you are going to does not match your current HEAD, add 10000 to the distance.
- For both of the parents: you now know their name. Add all their children that they are first-parent to to a dict: commit -> known-name.
Take your dict of known-named commits and start walking up the tree (towards the children, not the parents). Substract 10000 from the distance from the merged-into branch. While doing this walk to not go to commits that you are not first-parent to and stop as soon as you hit a branch-point (a commit that has two children). Also stop if you hit one of your branch-heads.

Now for each of your commits, you will have a list of distance values (that might be negative) to your branch heads. For each commit, the branch with the least distance is the one the commit was most likely created on.

If you have time, you might want to walk the whole history and then substract the history of master – that might give slightly better results if your branches have been merged into master before.

Couldn’t resist: Made a python script that does what I described. But with one change: with every normal step, the distance is not increased, but decreased. This has the effect that branches that lived longer after a merge-point are preferred, which I personally like more. Here it is: https://gist.github.com/Chronial/5275577

Usage: simply run git-annotate-log.py ^origin/master origin/branch1 origin/branch2 check the quality of the results (will output a git log tree with annotations).