`git` shows changed files after cloning, without a

2019-05-24 14:13发布

问题:

git clone git@github.com:erocarrera/pydot (35a8d858b) in a Debian with git config core.autocrlf input shows:

modified:   test/graphs/b545.dot
modified:   test/graphs/b993.dot
modified:   test/graphs/cairo.dot

These files have CRLF line endings, for example:

$ file test/graphs/cairo.dot
test/graphs/cairo.dot: UTF-8 Unicode text, with CRLF line terminators

The .gitattributes file contains:

*.py eol=lf
*.dot eol=lf
*.txt eol=lf
*.md eol=lf
*.yml eol=lf

*.png binary
*.ps binary

Changing core.autocrlf has no effect on the status of these files. Deleting the .gitattributes has no effect either. Changing these files with dos2unix does not change their status (as expected), and back with unix2dos shows no difference with diff versus an older copy. File permissions look unchanged with ls -lsa. Also, the files have uniform line endings as far as I can tell with vi -b (thus it shouldn't be the case that unix2dos or dos2unix convert from mixed to uniform line endings, which could have explained this strange behavior). I'm using git version 2.11.0.

What does git think has changed?

Somewhat relevant:

  1. Git status shows files as changed even though contents are the same
  2. Files showing as modified directly after git clone
  3. Cloning a git repo, and it already has a dirty working directory... Whaaaaa?

I didn't find an answer that explains this behavior during my search over several discussions. This issue arose from pydot # 163.

In more detail:

git status

On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   test/graphs/b545.dot
    modified:   test/graphs/b993.dot
    modified:   test/graphs/cairo.dot

no changes added to commit (use "git add" and/or "git commit -a")

git diff test/graphs/b993.dot

warning: CRLF will be replaced by LF in test/graphs/b993.dot.
The file will have its original line endings in your working directory.
diff --git a/test/graphs/b993.dot b/test/graphs/b993.dot
index e87e112..8aa0872 100644
--- a/test/graphs/b993.dot
+++ b/test/graphs/b993.dot
@@ -1,10 +1,10 @@
-diGraph G{
-graph [charset="utf8"]
-1[label="Umlaut"];
-2[label="ü"];
-3[label="ä"];
-4[label="ö"];
-1->2;
-1->3;
-1->4;
-}
+diGraph G{
+graph [charset="utf8"]
+1[label="Umlaut"];
+2[label="ü"];
+3[label="ä"];
+4[label="ö"];
+1->2;
+1->3;
+1->4;
+}

UPDATE:

Out of curiosity, I committed one of these files, dumped git log -1 -p > diff, and vi -b diff shows that git normalized

  1 commit 2021d6adc1bc8978fa08d729b3f4d565f9b89651
  2 Author:
  3 Date:
  4 
  5     DRAFT: experiment to see what changed
  6 
  7 diff --git a/test/graphs/b545.dot b/test/graphs/b545.dot
  8 index ebd3e8f..2c33f91 100644
  9 --- a/test/graphs/b545.dot
 10 +++ b/test/graphs/b545.dot
 11 @@ -1,9 +1,9 @@
 12 -digraph g {^M
 13 -^M
 14 -"N11" ^M
 15 -  [^M
 16 -  shape = record^M
 17 -  label = "<p0>WFSt|1571       as Ref: 1338    D"^M
 18 -]^M
 19 -N11ne -> N11:p0^M
 20 -}^M
 21 +digraph g {
 22 +
 23 +"N11" 
 24 +  [
 25 +  shape = record
 26 +  label = "<p0>WFSt|1571       as Ref: 1338    D"
 27 +]
 28 +N11ne -> N11:p0
 29 +}

Other weird observations: git checkout any of these files after cloning does not have any effect. After the above commit, the file b545.dot continued to have CLRF line endings in the working directory. Applying dos2unix followed by unix2dos didn't make git think that it has changed (whereas before the commit it did, probably because the committed file had CLRF line endings).

回答1:

This happens precisely because those files are committed with CRLF endings, yet the .gitattributes file says to commit them with LF-only endings.

Git can and will do CRLF-vs-LF-only conversion in two places:

  • During extraction from index to work-tree. A file stored in a commit or in the index is always assumed to be in a "clean" state, but when extracting that file from the index, to the work-tree, Git should apply any conversions directed by .gitattributes in the form of "change LF-only to CRLF", for instance, and also in the form of what Git calls smudge filters.

  • During the copy of a file from work-tree back to index. A file stored in the work-tree is in the "smudged" state, so at this point, Git should apply any "cleaning" conversions: for instance, change CR-LF to LF-only, and applying clean filters.

Note that there are two points at which these conversions can occur. This does not mean that they will occur at both points, just that these are the two possible places. As the .gitattributes documentation notes, the actual conversions are:

  • eol=lf: none on index -> work-tree; CR-LF to LF-only on work-tree -> index
  • eol=crlf: LF-only to CR-LF on index -> work-tree; none on work-tree -> index

Now, a file that's actually in the repository, stored in a commit, is purely read-only. It can never change inside that commit. More precisely, the commit identifies (by hash ID) a tree that identifies (by hash ID) a blob that has whatever contents it has. These hash IDs are themselves crytographic checksums of the object contents, so they are naturally all read-only: if we try to change the contents, what we get is instead a new, different object with a new, different hash ID.

Because git checkout actually works by coping the raw hash IDs from the commit's tree(s) to the index, the versions of files stored in the index are necessarily identical to those stored in the commit.

Hence, if somehow—regardless of the how—the committed files are in a form that disagrees with what .gitattributes directs Git to do, the files will become "dirty" in the work-tree regardless of the fact that you haven't done anything to them! If you were to git add the three files in question, that would copy them from work-tree to index, and hence delete the carriage-returns from their line endings. Hence they are, in git status terms, modified but not yet staged for commit.

Stripping out the carriage returns in the work-tree versions leaves them in the same state: they're modified with respect to what's in the index, because git add will now leave their LF-only line endings unchanged, producing new, different files that are in the index.

A more interesting question is: How did they get into the commit(s) in the wrong state? This is not something we can answer: only those who made those commits can produce that answer. We can only speculate. One way to achieve this is to add and commit the files without a .gitattributes in effect, then to set the .gitattributes into effect without git add-ing the files again. This way, the CR-LF endings get into someone's index and hence get into that user's commits, even though the .gitattributes file now says (but did not earlier say) that any new git add should strip away the carriage returns.



回答2:

Changing core.autocrlf has no effect on the status of these files

It should, but only after cloning again:

git config --global core.autocrlf false

git clone git@github.com:erocarrera/pydot pydot2
cd pydot2
git status

That would desactivate core.autocrlf globally, but this is just for testing here.



回答3:

Thanks to @torek for the explanation (which agrees with my conjecture).

In summary, the asymmetric git configuration leads to commit(checkout(Index)) not being the identity mapping. With CRLF in the index, this particular configuration checked out CRLF, but after the input transformations in effect (eol=lf), git would commit LF instead of CRLF.

The root cause of this confusion was comparing the:

  • file I see in the working directory, with the
  • committed file.

This doesn't show whether the file has changed. What one should compare is what git will commit after applying the input transformations with what is already committed. Clearly, if those two items differ, then the file has changed.

Following this reasoning, one could declare the repository "unstable", in that it regards itself as modified in absence of interaction with the world. This supports avoiding this state by changing the committed files to LF, or changing the .gitattributes (I prefer committing LF).

In this situation, git would commit LF for both LF and CRLF in the working directory, so dos2unix and unix2dos would had no effect on the commit outcome, thus neither to the file's status.