Git Commit Generation Numbers

2019-02-05 17:12发布

What are git commit generation numbers (hacker news link) and what are their significance?

标签: git commit
2条回答
手持菜刀,她持情操
2楼-- · 2019-02-05 17:33

Just to add to siri's answer, "Commit Generation Numbers" are:

A commit's generation is its height in the history graph, as measured from the farthest root. It is defined as:

  • If the commit has no parents, then its generation is 0.
  • Otherwise, its generation is 1 more than the maximum of its parents generations.
  • an old topic already mentioned at the creation of Git in 2005:

Linus Torwald (yester, July 14th):
Ok, so I see that the old discussion about generation numbers has resurfaced.
And I have to say, with six years of git use, I think it's not a coincidence that the notion of generation numbers has come up several times over the years: I think the lack of them is literally the only real design mistake we have.
[...]
It actually came up as early as July 2005, so the "let's use generation numbers in commits" thing is really old.

  • about the question of quickly knowing if a commit is an ancestor of another commit (without having to walk back the DAG -- the graph of commits --):

I think it's entirely reasonable to say that the issue basically boils down to one git question: "can commit X be an ancestor of commit Y" (as a way to basically limit certain algorithms from having to walk all the way down). We've used commit dates for it, and realistically it really has worked very well. But it was always a broken heuristic.

So yes, I personally see generation counters as a way to do the commit date comparisons right. And it would be perfectly fine to just say "if there are no generation numbers, we'll use the datestamps instead, and know that they could be incorrect".

That "use the datestamps" fallback thing may well involve all the heuristics we already do (ie check for the stamps looking sane, and not trusting just one individual one).

As the Hacker news thread mentions:

Generation numbers are a result of the state of the tree, while timestamps are derived from the ambient (and potentially incorrect!) environment from which the commit was made.

At the moment, each commit stores a reference to the parent tree.
By parsing that tree and reading the entire history you can obtain a hierarchy of commits.
Because you need to order commits in many situations, reading the entire history is extremely inefficient, so git uses timestamps to determine the ordering of commits.
This of course fails if the system clock on a given machine is off.
With a generation number, you can get an ordering locally from the latest commits, without having to rely on timestamps or read the entire tree.

When you have a commit with generation n, any later commits that include it wound have generation >n, so to tell the relation between commits, you only need look as far back as n, and you can immediately get the order of any intermediate commits.
It has nothing to do with "easy to remember". It's about making git more efficient and robust

  • not redundant:

Generation numbers are completely redundant with the actual structure of history represented by the parent pointers.

Linus:

Not true. That's only true if you add "... if you parse the whole history" to that statement.
And we've never parsed the whole history, because it's just too expensive and doesn't scale. So right now we depend on commit dates with a few hacks.
So no, generation numbers are not at all redundant. They are fundamental. It's why we had this discussion six years ago.


There is still a debate as to where to cache that information (or if it should be cached), but for the user point of view, it still is about some "easy to remember" information (which isn't the goal of commit generation number):

So it's almost, but not quite, like the revision numbers everyone else has always had?

Yes -- almost, but not quite.
If you and I each create a branch off of a commit with gen #123, then, as I understand it, the subsequent commits in my branch would be #124, #125, etc., and your commits in your branch would also be #124, #125, etc.

Contrast this: - with CVS, where I would have 1.124.1.1, 1.124.1.2, etc., and you would have 1.124.2.1, 1.124.2.2, or - with Subversion, where I might get revisions 125, 128, and 129, while the server gave your commits #124, 127 and 130, and someone else, on a totally different part of the project got #126.

As long as development proceeds linearly, on a single branch, then yeah, it's about the save as revision numbers in a centralized RCS -- once you start branching and merging, though, it represents a different concept entirely.

For a single repository, it does have a very similar interpretation to, say, svn revnos.
You can speak of "revision #125 of a branch" in a specific repository. Which is generally exactly what you need for human-to-human communication about development.
"Can you see if that bug is in r125 of unstable?" "I've got all changes up to r245 of prod"
I guess the confusing aspect would be if "r245 of prod" in the central server was "r100 of prod" in my local repo because I haven't cloned the full history?

查看更多
贪生不怕死
3楼-- · 2019-02-05 17:39

The problem (as implied in the thread on git@vger.kernel.org) is that the DAG direction that we trust is counted in the reverse direction, from branch head back through parentage. The generation numbers (even if recorded at commit time) are counted through descendants. Plus we mess with the perceived history often in our different (distributed) repos - hence all the issues.


Just read Linus's latest, apart from his misreading about renames (I think George Spelvin was agreeing with him - do not record renames within the repo, simply take snapshots), he does point out that:

the very basic design of git is all about incomplete DAG traversal. The DAG traversal part is pretty obvious and simple, but the partial thing really is very very important.".

Thus essentially a pre-recorded commit "generation" number would tell you how far (the maximum) you still have to go to the bottom (root), so if you can trust it, then you can make the choice about stopping an incomplete DAG traversal. Without it you would have to go the whole way to the root, which is inefficient.

So I think I've changed my mind now I realise it is a stopping criteria. That's not to say that some (locally calculated) cache might not speed up some searches.

查看更多
登录 后发表回答