Can I make git recognize a UTF-16 file as text?

2019-01-02 17:08发布

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.

Can git be taught to recognize that this file is text and handle it appropriately?

I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.

7条回答
皆成旧梦
2楼-- · 2019-01-02 17:21

I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool takes the same arguments as git diff would, but runs a diff program of your choice instead of the built-in GNU diff. So pick a multibyte-aware diff (in my case, vim in diff mode) and just use git difftool instead of git diff.

Find "difftool" too long to type? No problem:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git rocks.

查看更多
临风纵饮
3楼-- · 2019-01-02 17:27

Have you tried setting your .gitattributes to treat it as a text file?

e.g.:

*.vmc diff

More details at http://www.git-scm.com/docs/gitattributes.html.

查看更多
残风、尘缘若梦
4楼-- · 2019-01-02 17:27

Had this problem on Windows recently, and the dos2unixand unix2dos bins that ship with git for windows did the trick. By default they're located in C:\Program Files\Git\usr\bin\. Observe this will only work if your file doesn't need to be UTF-16. For example, someone accidently encoded a python file as UTF-16 when it didn't need to be (in my case).

PS C:\Users\xxx> dos2unix my_file.py
dos2unix: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 Unix format...

and

PS C:\Users\xxx> unix2dos my_file.py
unix2dos: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 DOS format...
查看更多
谁念西风独自凉
5楼-- · 2019-01-02 17:28

There is a very simple solution that works out of the box on Unices.

For example, with Apple's .strings files just:

  1. Create a .gitattributes file in the root of your repository with:

    *.strings diff=localizablestrings
    
  2. Add the following to your ~/.gitconfig file:

    [diff "localizablestrings"]
    textconv = "iconv -f utf-16 -t utf-8"
    

Source: Diff .strings files in Git (and older post from 2010).

查看更多
公子世无双
6楼-- · 2019-01-02 17:30

By default, it looks like git won't work well with UTF-16; for such a file you have to make sure that no CRLF processing is done on it, but you want diff and merge to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).

But looking at the .gitattributes manpage, here is the custom attribute that is binary:

[attr]binary -diff -crlf

So it seems to me that you could define a custom attribute in your top level .gitattributes for utf16 (note that I add merge here to be sure it is treated as text):

[attr]utf16 diff merge -crlf

From there you would be able to specify in any .gitattributes file something like:

*.vmc utf16

Also note that you should still be able to diff a file, even if git thinks it's binary with:

git diff --text

Edit

This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git use a different tool to see differences (via --ext-diff), that answer suggests Guiffy.

But what you likely need is just to diff a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff and the following shell script:

#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")

Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.

As for the output to the terminal when looking at a diff of a UTF-16 file:

Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.

GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).

查看更多
人气声优
7楼-- · 2019-01-02 17:32

I have written a small git-diff driver, to-utf8, which should make it easy to diff any non-ASCII/UTF-8 encoded files. You can install it using the instructions here: https://github.com/chaitanyagupta/gitutils#to-utf8 (the to-utf8 script is available in the same repo).

Note that this script requires both file and iconv commands to be available on the system.

查看更多
登录 后发表回答