Preserve history of file from before it was moved

2019-07-20 07:14发布

Suppose I have a git repo containing a file text.txt inside a directory dir_a. Later on, I decide to move text.txt to a new directory called dir_b.

After a while, I decide I should split dir_b into its own standalone git repository using git subtree split. By default, the earliest commit in dir_b's repository is the commit where I moved text.txt from dir_a to dir_b, which is unfortunate, because e.g. a blame won't work as intended.

Is there a way to preserve, in the new git repo, the changes made to text.txt when it was still in dir_a?

To make it clear, in the original repository, the commit where I move text.txt from dir_a to dir_b successfully registers the move operation as a rename, so e.g. git diff works properly there. My problem is that, in the new repository, the commits made before the move aren't carried over to the new repository.

1条回答
戒情不戒烟
2楼-- · 2019-07-20 07:45

Edit: I quite missed the git subtree split -P prefix part of this. The original answer still applies, but with a possibly-fatal twist.

When you run git subtree split -P prefix [ options ] [ commit-range ], you are telling Git to copy some commits to new ones. You have Git copy whatever commits contain any files within the given prefix, but with these changes:

  1. Discard all files that don't begin with the given prefix.
  2. Rename the remaining files to strip prefix (and a slash) off.
  3. If the commit matches the previously-copied commit, drop it entirely.

(You could do this with git filter-branch as well, although it would be slower than git subtree split, and it requires that you first create a new branch to filter.)

The result is a new, disjoint commit graph (or subgraph since it's now added to your main commit graph), rooted at the first-copied commit and terminating at the last-copied one. (The copy process has to enumerate commits, in Git's usual backwards fashion, from a single tip commit, not from multiple tip commits. Once all commits have been found this way, the copying goes from root / last-enumerated, to tip, as it must.) You can then give this new sub-graph a branch name using git subtree's -b branch option. If you don't give it a name, you have a short period (14 days by default) during which you can do something with the tip commit hash ID that git subtree split prints, and after that the copies are eligible for automatic garbage-collection.

As a brief illustration, consider the following graph:

     C--D--E
    /       \
A--B         H--I--J--K   <-- master
    \       /
     F-----G

Let's say commit A has in a README (and nothing else), B adds the first part of the project, C-D-E is more of the project, F and G were from a feature branch and add a subtree named subbie containing various files, H merges the subtree, in I it's renamed feature, in J nothing happens to it, and in K feature/README_TOO is added.

If you now split feature as a subtree, this makes Git copy commits:

  • I: feature first appears as a name, containing, e.g., feature/__init.py and feature/impl.py, for instance.
  • K: feature/README_TOO appears.

As a new, independent sub-graph of commits, it looks like this:

     C--D--E
    /       \
A--B         H--I--J--K   <-- master
    \       /
     F-----G

I'--K'                    <-- dash-b-argument

Note that we did not copy F, G, and H: they do not have files whose name starts with feature/. Commit J does have such files, but they are the same as they were in commit I, so we skipped it. Meanwhile, the names of the files in commits I' and K' are not feature/__init__.py and so on, but rather simply __init__.py and so on.

As I noted in the original answer, the history in a repository is the commits. We view the history by starting from a branch tip commit and working backwards. If we start at K' and work backwards to I', the history is just those two commits. To discover the rename, we would have to also copy commits F and G at least, and maybe H as well (there's nothing for H to merge this time as we would skip A-B-C-D-E, so we'd probably just drop H entirely). But to do that, we would have to know to preserve subbie/*.

You could modify the git subtree code to allow additional preserved-as-as prefix arguments. There is no clear way to reverse this after-the-fact, though. The basic git subtree code relies on a unique prefix: it was always stripped off, so to reverse the transformation, we always add it back. The two obvious options are: never strip any prefix (so never add anything), or require that additional, non-stripped prefixes never "collide with" prefix-stripped names. That is, given any arbitrary copied commit, if its snapshot has a file named pa/th/to/file.ext, either pa/th/to is not a "preserved in place" prefix (so it gets the -P prefix added back), or else pa/th/to is such a prefix (so it gets nothing added).


Original answer

In Git, files don't have history. There is nothing to preserve!

In Git, only commits have—or rather, are—history. Each commit is a complete snapshot of a source tree, plus some meta-data: a name and email and a timestamp (as the author of the commit), another name/email/timestamp triple (for the committer); a commit log message; and—crucial for forming history—the ID of a parent commit.

(Some commits, which we call merge commits, have two or more parents. At least one commit—namely the first ever made—has no parents; we call this a root commit. But most commits just have one parent, which is normally the commit that was the tip of some branch, just before the committer made a new commit that became the tip of that branch.)

It's by comparing a commit against its parent that we find out what happened over time. If the previous (parent) commit had 10 files, and the subsequent (child) commit had 11 files, then someone must have added a file. If the child commit has a new line 20 in README.txt, they must have added that line. But we only discover these dynamically, by comparing parent and child. That is the history, formed by the commits.

The git blame code will, as it works from child back to parent (and then treating that parent as another child of another parent), search for lines taken from other files, or for entire files renamed from one location to another. How well that search works is a separate matter—but as a general rule, if some file p/a/t/h.ext exists in the parent but not the child, and some other file n/e/w.name exists in the child but not the parent, Git will put these two files into a "candidates for rename detection" list.

If two differently-named files are absolutely, 100%, bit-for-bit identical, Git will nearly always1 pair them up. The less-identical they become, the less-likely Git will be to pair them up. This pairing has control knobs: in git diff and friends, they are the --find-renames value. There is also a --find-copies and a --find-copies-harder. In git blame, the -C argument controls things, in a somewhat different way. I have not experimented enough with this to say for sure how it works, but either one or two -C arguments should certainly detect a whole-file rename, based on the documentation.


1For git diff, rename-finding is completely disabled by default in Git versions before 2.9, but enabled by default in Git versions 2.9 and higher. You can set diff.renames to true to enable it, without configuring a particular -M / --find-renames threshold, in older versions of Git.

There is also a maximum pairing-queue size, configurable as diff.renameLimit. Hitting that limit is rare, although renaming every file in a directory—which is how Git treats renaming a directory—is more likely to be able to hit it. The default limit has grown over the years; it used to be 100, then 200, and is now 400 files.

查看更多
登录 后发表回答