The linux file
command does a very good job in recognising file types and gives very fine-grained results. The diff
tool is able to tell binary files from text files, producing a different output.
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff
will attempt a text-based comparison.
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.
The diff manual specifies that
diff determines whether a file is text
or binary by checking the first few
bytes in the file; the exact number of
bytes is system dependent, but it is
typically several thousand. If every
byte in that part of the file is
non-null, diff considers the file to
be text; otherwise it considers the
file to be binary.
file
is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file
; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file
uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file
does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
A quick-and-dirty way is to look for a NUL
character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL
.
Update: According to the diff manual, this is exactly what diff does.
You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.
This approach uses same criteria as grep
in determining whether a file is binary or text:
is_text_file() {
grep -qI '.' "$1"
}
grep options used:
-q
Quiet; Exit immediately with zero status if any match is found
-I
Process a binary file as if it did not contain matching data
grep pattern used:
'.'
match any single character. All files (except an empty file)
will match this pattern.
Notes
- An empty file is not considered a text file according to this test.
- Symbolic links are followed.
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.