I have a git repository (covering more or less project history) and separate sources (just a tarball with few files) which have forked some time ago (actually somewhere in 2004 or 2005).
The sources from tarball have undergone quite a lot of changes from which I'd like to incorporate some. Now the question is - how to find out what was actually the branch point for the changed sources to get minimal diff of what has happened there.
So what I basically want is to find place in git history, where the code is most similar to the tarball of sources I have. And I don't want to do that manually.
It is also worth mentioning that the changed sources include only subset of files and have split some files into more. However the code which is in there seem to get only small modifications and several additions.
If you want to play with that yourself, the tarball with sources is here and Git is hosted at Gitorious: git://gitorious.org/gammu/mainline.git
If you have a rough idea as to where the fork occurred, consider using Will Manley's
git meld
. (See also: View differences of branches with meld?.)To do this, add the tarball contents to your repository (which you'll be doing anyway). After installing Meld and
git-meld
, runon different commits until you find the one with the least differences. This command will open
meld
and view the changes in the directory tree between the specified commits, with identical files hidden. Example screenshots :Meld showing two very different commits:
Showing two similar commits:
Import that files in the tarball into a git revision, on a separate branch or a completely new one: the position in the revision graph isn't important, we just want it available as a tree.
Now for each revision in master, just diff against that tree/revision ('imported') and just output how big the diff is. Something like:
So the revision with the smallest patch size will be the "closest", by a very rough rule of thumb. (An identical revision will produce a patch size of 0, and anything else will certainly be non-zero, and the more that's changed, the bigger).
Not a great solution, but to get a guess of which revisions it might be: Assume that some of the files in the tar ball have not been changed since they were branched. Run git hash object against each file in the tar ball, then search for those files in the repository using git show. Then try and find the commits under which these files were included, possibly using git whatchanged. The answer to your question might then be the commit with the most common files, but it'll still be a bit hit and miss.
how was the fork made? was it a clone that someone else made and then did their own work? if so, then this is really easy. all you need to do is create a local branch that pulls in the code from the fork. git will see the ancestry of the forked branch pointing to one of the commits from your original repository and will "connect the dots" so to speak... it will reconnect the history from your original repository to the fork.
you should be able to do this:
at this point, you can run
gitk
and see the complete history of the forked branch and your local repository, and see if they connect or not.based on what araqnid said I came up with 9c6c864426bf88429e77c7e22b5aa78e9295b97a (just asked for stuff between 0.61.0 and HEAD) this is probably not the best) you might do better with something like
assuming you've imported the tarball into git and have that revision checked out (I did this by untaring and then
So after you do that and the run the above it should output the size of all the diffs in ascending order of patchsize (the first one will be 0 since it'll find the current head) it'll take a long time... but it should find the smallest diff...
In the general case, you'd actually have to examine every single commit, because you have no way of knowing if you might have a huge diff in one, small diff the next, then another huge diff, then a medium diff...
Your best bet is probably going to be to limit yourself to specific files. If you consider just a single file, it should not take long to iterate through all the versions of that file (use
git rev-list <path>
to get a list, so you don't have to test every commit). For each commit which modified the file, you can check the size of the diff, and fairly quickly find a minimum. Do this for a handful of files, hopefully they'll agree!The best way to set yourself up for the diffing is to make a temporary commit by simply copying in your tarball, so you can have a branch called
tarball
to compare against. That way, you could do this:to get a nice list of all the commits with their diff sizes (the first three columns will be SHA1, number of lines added, and number of lines removed). Then you could just pipe it on into
awk '{print $1,$2+$3}' | sort -n -k 2
, and you'd have a sorted list of commits and their diff sizes!If you can't limit yourself to a small handful of files to test, I might be tempted to hand-implement something similar to
git-bisect
- just try to narrow your way down to a small diff, making the assumption that in all likelihood, commits near to your best case will also have smaller diffs, and commits far from it will have larger diffs. (Somewhere between Newton's method and a full on binary/grid search, probably?)Edit: Another possibility, suggested in Douglas' answer, if you think that some files might be identical to those in some commit, is to hash them using
git-hash-object
, and then see what commits in your history have that blob. There's a question with some excellent answers about how to do that. If you do this with a handful of files - preferably ones which have changed frequently - you might be able to narrow down the target commit pretty quickly.