We have two repositories that evolved in parallel: one for the code of our project, and one for the tests of this project. I would like to merge these two repositories in one repository, in such a way that, when I go back in history, I still have both directory structures.
Suppose that our current structure is the following, where project
and tests
are two separate git repositories:
project
/src
/include
tests
/short
/long
I would like to end up with one git repository that has two directories project
and tests
.
I can't simply merge these two repositories using the techniques described in this answer, this one, or this site: they result in repositories that have two distinct histories before the merge, and when checking out a past commit, you have either src
and include
, or short
and long
, but you don't have all four of them as they appeared at that time.
If I checkout a commit that was created in project
4 months ago, I would like to see project/src
and project/include
as they appeared in this commit, but I would like also to have tests/short
and test/long
as they were at the same time in the (then separate) test
repository.
I understand that the ordering of the commits between both repositories will only depend on time, and may not be very precise. But that's good enough for me. And of course I know that I can't keep the original git ids from each repo. That's fine, because these two repos are actually fresh imports from another RCS, and so there is no git id that was ever recorded anywhere.
It should be doable to checkout one by one all the commits from each repo, ordered by time across repositories, and commit the resulting files. Is there already a tool that would do this?
I think you should combine the two repositories creating 2 branches (
git fetch
without merge). Then interactively rebase one branch, stop at every commit and dogit cherry-pick
the corresponding commit into the current branch. Then continue interactive rebase to the next commit (this saves the "edited" commit without modifications).Perhaps that can even be automated. Instead of interactive rebase and manual cherry-picking you probably can use
git rebase --interactive -x
executinggit cherry-pick
after every commit. The problem is how to find out what commit to cherry-pick. I think it should besecond-branch~count
. The count can be edited before interactive rebase while editing rebase-todo file.There is, it's named
git filter-branch
. By far the simplest to implement is to walk theproject
history and hunt up "the" correspondingtests
commit's content, here's a sketch:which will get slow if your "tests" history's got many thousands of commits, if you're talking about the linux repo or something on that scale it would wind up cheaper to pregenerate a date-sorted tests list and step through that.
Edit: for a date-based approach that makes this pretty easy but assumes one of the two repositories is going to be "in control" of which commits come from the other repository, see jthill's answer. You end up with a commit history that exactly matches the "project" history, possibly squashing some of the "tests" history. The answer below is more appropriate if you need to add a prefix to both sets of histories, or want to interleave them (e.g., need two different "tests" updates for the same "project" commit).
phd's answer is fine, but if I were doing this myself and wanted to make it really neat and clean, I would use a different approach.
If the trees for the two repositories don't overlap, it's certainly possible to do this—and by bypassing the usual Git mechanisms, going straight to underlying
git read-tree
commands, you can automate it. (This is where VonC's recent comment rejecting my claim that Git and Mercurial are very much alike is true: if you bypass the top level Git commands, you get something you cannot get nearly as easily in Mercurial.)Just as in phd's answer, you would start this process by combining the two repository commit databases via
git fetch
. (You can do this in a third repo, which I'd recommend since it makes it easier to restart the process from scratch if you decide you want to tweak some parameters, or by adding either repo A to repo B, or repo B to repo A.) But after that, everything diverges.You now have two disjoint commit DAGs:
(If repoA and repoB both have more than one branch tip, draw whatever simplified diagram of their commits is more appropriate.)
Your next step is to enumerate all the commits in each of the two disjoint DAGs, using
git rev-list --topo-order --reverse
and whatever other sorting options you like. When and whether--topo-order
is required depends on the topology and other sorting information, but in general you will want a parent commit listed before any of its children.Given these two linearized lists of commit hash IDs, you now have the hard part: constructing the graph of new, combined trees you wish to commit. Every new commit will be made by combining one commit from each of the two old graphs. If one of the graphs is complex (as for repoA above) with branches and merges, and one isn't (as for repoB above), this can be particularly tricky.
I've made my own setup for this, where I have a very simple graph:
In my simplified setup, I'd like to make my first commit on my new master be commit
C
that combines the trees ofA
andO
:Then I'd like to make, as my second commit on
master
, the combination ofA
andP
(notA
andO
and notB
andO
either), and as my last commit, the combination ofB
andP
, so that I end up with:So, here we are in a new empty repository, except that we've read in projects A and B:
(I accidentally didn't hyphenate commit O, but did hyphenate all the others. The
sed
is to remove some blank lines that don't really help reading, in this case.)Now we build the new commits, one at a time, using
git read-tree
to populate the index to make the commits. We start with an empty index (which we have right now):We want our first commit to combine
A
andO
, so let's read those two commits into the index now. If we had to add a prefix to the tree inA
we could do that here:We can make the commit we need now:
Now we need to make the next commit, which means we need to build up the correct tree in the index. To do that we first have to clean it out; otherwise the next
git read-tree --prefix
will fail with a complaint about overlapping files andCannot bind.
So now we empty the index, then read commits A and P:If you like, you can examine the result using
git ls-file --stage
again:In any case they can now be committed as the new commit:
(you can see now how I end up with inconsistent hyphenation :-) ). Last, we repeat the process by emptying the index, reading in the two desired commits (B+P), and committing the result:
(I used symbolic names here to get the last two commits, but hash IDs from
git rev-list
would of course work well.) We can now see the three commits, all onmaster
:and it's now safe to delete the
A/master
andB/master
references (and the two remotes). There's one peculiarity: since we did all the work directly in the index, without bothering with a work-tree, the work-tree is still completely empty:To fix that at the end, we should just run
git checkout HEAD -- .
:How to write your own automation script
In practice, you will probably want to use
git write-tree
andgit commit-tree
, rather thangit commit
, to make the new commits. You would write a little script (in whatever language you like) to rungit rev-list
to collect the hashs IDs of commits to combine. The script must inspect those commits—e.g., by looking at authorship and dates, or file contents, or whatever—to decide how to interweave the commits. Then, having made the decisions about interweaving and what branch-and-merge structures to provide, the script can begin the process of repeatedly doing these steps:--prefix
option is appropriate—in your case this is the--prefix=
, i.e., the empty string, but in other cases it would be a directory name with a trailing slash).--prefix
, so that there are no collisions between entries fromA
andB
.git write-tree
to write the tree. Its output is the tree hash ID for the next step.git commit-tree
with appropriate-p
argument(s) to set the parent(s) of the new commit. Feed it the appropriate (combined or whatever) commit message text. Use the environment variablesGIT_AUTHOR_NAME
,GIT_AUTHOR_EMAIL
,GIT_AUTHOR_DATE
,GIT_COMMITTER_NAME
,GIT_COMMITTER_EMAIL
, andGIT_COMMITTER_DATE
to control the author and committer names and dates. The output fromgit commit-tree
is the hash ID, which is the parent of some subsequent commit.When the whole thing finishes, the last commits made for any particular branch or set of branches are the hash IDs that go into those branches, so you can now run:
for each such hash ID.