Grep thinks text file is binary, but it isn't

2019-03-02 09:49发布

问题:

I came across a .cpp file in our codebase that is seen as binary by grep. So I can't grep it like a text file, which is annoying and obviously not how things ought to be. So I want to know why grep thinks the file is binary and address the issue.

I tried to find any characters out of the ordinary using the command

grep -Pna --color -r "[\x00-\x08]|[\x10-\x19]|[\x80-\xFF]" test.cpp

but it doesn't yield any matches.

How can figure out the cause of this problem?

I should mention I'm on windows git bash.

Output of locale:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

回答1:

Since you’re using MS Windows, it’s possible that the test.cpp file is encoded using either UTF-16 (common in recent versions of Windows) or Windows-1252 (CP-1252) as its character encoding (perhaps a typographic quote in one of the comments).

When your locale is set to UTF-8 and grep detects invalid characters for that locale, it assumes that the file is binary. A quick way around this issue, is to get grep to use the C locale by temporarily modifying the LC_ALL environment variable when running the grep command:

LC_ALL=C grep pattern test.cpp

A better long term solution would be to convert text files (using iconv or your favourite text editor) to use UTF-8 as their character encoding.