I understand well how Git can support file moves : as it uses file hash, a "added" file is easily detected as beeing same as the "removed" one.
My question is about refactoring : considering Java, the package declaration changes so the file content will NOT be the same. In such case, how does Git determine that the "added" file shares history with the "removed" one ? Does it check for "most similar content" assuming I only made minor changes, or similar non-deterministic solution ?
As mentioned in Git FAQ, it will detect similar content based on an heuristic.
Git has to interoperate with a lot of different workflows, for example some changes can come from patches, where rename information may not be available. Relying on explicit rename tracking makes it impossible to merge two trees that have done exactly the same thing, except one did it as a patch (create/delete) and one did it using some other heuristic.
On a second note, tracking renames is really just a special case of tracking how content moves in the tree. In some cases, you may instead be interested in querying when a function was added or moved to a different file. By only relying on the ability to recreate this information when needed, Git aims to provide a more flexible way to track how your tree is changing.
However, this does not mean that Git has no support for renames.
The diff machinery in Git has support for automatically detecting renames, this is turned on by the '-M
' switch to the git-diff-*
family of commands.
The rename detection machinery is used by git-log(1) and git-whatchanged(1), so for example, 'git log -M
' will give the commit history with rename information.
Git also supports a limited form of merging across renames.
The two tools for assigning blame, git-blame(1)
and git-annotate(1)
both use the automatic rename detection code to track renames.
git log
gives you some details about that heuristic:
-B[<n>][/<m>]
Break complete rewrite changes into pairs of delete and create. This serves two purposes:
It affects the way a change that amounts to a total rewrite of a file not as a series of deletion and insertion mixed together with a very few lines that happen to match textually as the context, but as a single deletion of everything old followed by a single insertion of everything new, and the number m controls this aspect of the -B
option (defaults to 60%).
-B/70% specifies that less than 30% of the original should remain in the result for git to consider it a total rewrite (i.e. otherwise the resulting patch will be a series of deletion and insertion mixed together with context lines).
When used with -M, a totally-rewritten file is also considered as the source of a rename (usually -M only considers a file that disappeared as the source of a rename), and the number n controls this aspect of the -B option (defaults to 50%).
-B20% specifies that a change with addition and deletion compared to 20% or more of the file's size are eligible for being picked up as a possible source of a rename to another file.
-M[<n>]
If generating diffs, detect and report renames for each commit. For following files across renames while traversing history, see --follow
.
If n is specified, it is a is a threshold on the similarity index (i.e. amount of addition/deletions compared to the file's size).
For example, -M90% means git should consider a delete/add pair to be a rename if more than 90% of the file hasn't changed.
Additional references:
- Linus's ultimate content tracking tool blog post, by Junio C Hamano, maintainer of Git.
- Getting Git to Acknowledge Previously Moved Files
- How to make git mark a deleted and a new file as a file move?
- How does Git solve the merging problem?
Note: With Git 2.18 (Q2 2018), git status
should now show you renames (instead of delete/add files) when you move/rename files.
See "How to tell Git that it's the same directory, just a different name".