How to tell binary from text files in linux

2019-03-25 01:58发布

问题:

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.

Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.

To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.

回答1:

The diff manual specifies that

diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.



回答2:

file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".

If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).



回答3:

A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.

Update: According to the diff manual, this is exactly what diff does.



回答4:

You could try to give a

strings yourfile

command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.



回答5:

These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.

See here for how Subversion does it.



回答6:

This approach uses same criteria as grep in determining whether a file is binary or text:

is_text_file() { 
  grep -qI '.' "$1"
}

grep options used:

  • -q Quiet; Exit immediately with zero status if any match is found
  • -I Process a binary file as if it did not contain matching data

grep pattern used:

  • '.' match any single character. All files (except an empty file) will match this pattern.

Notes

  • An empty file is not considered a text file according to this test.
  • Symbolic links are followed.


回答7:

A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary



回答8:

Commands like less, grep detect it quite easily(and fast). You can have a look at their source.