Git messed up my files, showing chinese characters

2020-05-18 16:47发布

问题:

disclaimer: By Git, I mean 'I' messed up.

Earlier, I wanted git-gui to show me the diff for which it thinks are binary files.

So I made some changes to my .\.gitattributes

*.ini       text
*.inc       text

But it didn't work. Then I made some changes to my .\.git\info\attributes

*.ini       text
*.inc       text
*.inc crlf diff
*.ini crlf diff

and it worked.

But now when I go back to previous commits it messes up...

This is how it should look:

It doesn't happen in all the files. EDIT: It happens only in files that have any special characters in them.

Q: Is it the issue with the commits itself or just some setting?
Q: Can I recover?

回答1:

Your ini files are saved in UTF-16LE, the encoding that Windows misleadingly describes as ‘Unicode’.

Git's default diffing tools don't work on UTF-16, because it's not an ASCII-compatible encoding. This is why git detected the files as binary originally.

LF/CRLF newline conversion is seeing each 0x0A byte as being a newline, and replacing it with 0x0D-0x0A. But, in a UTF-16LE file, a newline is actually signalled by 0x0A-0x00, and replacing that with 0x0D-0x0A-0x00 means that you've got an odd number of bytes, so the alignment of each two-byte code unit in the next line is out of sync. Consequently every other line gets mangled.

Your options are:

  1. Revert the attribute change and let Git handle the files as binary (losing the benefit of diffs).

  2. Save the files in an ASCII-compatible encoding. It looks like your content doesn't actually have any non-ASCII characters in, so hopefully that's not a problem? Normally you would want to save all your files as UTF-8 - this is ASCII-compatible but also allows all Unicode characters to be used. But that depends on whether Rainmeter supports reading INI files encoded like that (probably not).

  3. Configure git to use a different diff tool, though this will make it more complicated for others to work with your repo.



回答2:

I had a similar problem recently. We have a project-wide .gitattributes file at the root level, which includes the lines:-

* text=auto
*.sql     text

One of our team was writing SQL code using SQL Management Studio which, unknown to him, was saving the files as UTF-16. He was able to check-in the code to Git without problem, but on check-out the code was translated to the Chinese characters as described by this post.

A hexdump of the files in question confirmed the issue was indeed the translation of 0x000A to 0x000A0D.

For us the solution was to convert the files to ASCII using the following:-

  1. Delete the offending file from the working directory
  2. Create a temporary .gitattributes file in the local directory to force git to check-out the file without performing line-ending conversion. e.g. include the line *.sql binary

  3. Check-out the file(s) from Git. You should see that the files have not been translated and have no Chinese characters.

  4. Convert the file to ASCII. We used Notepad++ for this, but it's also possible to use iconv, which is installed as part of Git For Windows. I think UTF-8 would also be an option if the file contains non-ASCII characters - but this was not necessary for our purposes.
  5. Check-in the ASCII version of the file
  6. Delete the local .gitattributes file


回答3:

Here's a (bad) power-shell script that will fix files in this state. It will replace the sequence "0x0D 0x00 0x0D 0x0A" with "0x0D 0x00 0x0A" then overwrite the file it was given.

Afterwards you should probably re-save the file in something like UTF-8.

function Fix-Encoding
{
    Param(
        [String]$file
    )
    $f = get-item $file;
    $bytes = [System.IO.File]::ReadAllBytes($f.fullname);
    $output = new-object "System.Collections.Generic.List[System.Byte]"
    $output.Capacity = $bytes.Length

    for ($i = 0; $i -lt $bytes.Length; $i++)
    { 
        if ($i -lt $bytes.Length + 3)
        {
            if ($bytes[$i] -eq 0x0D -and $bytes[$i+1] -eq 0x00 -and $bytes[$i+2] -eq 0x0D -and $bytes[$i+3] -eq 0x0A) 
            {
                $output.Add(0x0D);
                $output.Add(0x00);
                $output.Add(0x0A);
                $i += 3
            }
            else {
                $output.Add($bytes[$i]);
            }
        }
     }
    [System.IO.File]::WriteAllBytes($f.fullname, $output)
}


回答4:

To add to a good explaination by @bobince. One solution to this problem (except files with special characters) is to convert everything to utf-8. I solved this by running a python script in notepad++ on all files in a directory (from a computer that did not have the files messed up).

I found the original script here

A copy of the notepad++ python script:

import os;
import sys;
filePathSrc="C:\\Temp\\UTF8"
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
      if fn[-4:] != '.jar' and fn[-5:] != '.ear' and fn[-4:] != '.gif' and fn[-4:] != '.jpg' and fn[-5:] != '.jpeg' and fn[-4:] != '.xls' and fn[-4:] != '.GIF' and fn[-4:] != '.JPG' and fn[-5:] != '.JPEG' and fn[-4:] != '.XLS' and fn[-4:] != '.PNG' and fn[-4:] != '.png' and fn[-4:] != '.cab' and fn[-4:] != '.CAB' and fn[-4:] != '.ico':
        notepad.open(root + "\\" + fn)
        console.write(root + "\\" + fn + "\r\n")
        notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
        notepad.save()
        notepad.close()