How do I distinguish between 'binary' and

2019-01-13 04:13发布

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).

In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.

However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.

So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.

11条回答
叼着烟拽天下
2楼-- · 2019-01-13 04:32

Its an old topic, but maybe someone will find this useful. If you have to decide in a script if something is a file then you can simply do like this :

if file -i $1 | grep -q text;
then 
.
.
fi

This will get the file type, and with a silent grep you can decide if its a text.

查看更多
家丑人穷心不美
3楼-- · 2019-01-13 04:34

You can use libmagic which is a library version of the Unix file command line.

There are wrapper for many languages:

查看更多
可以哭但决不认输i
4楼-- · 2019-01-13 04:35

You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.

file README
README: ASCII English text, with very long lines

file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
查看更多
We Are One
5楼-- · 2019-01-13 04:38

To list text file names in current dir/subdirs:

$ grep -rIl ''

Binaries:

$ grep -rIL ''

To check particular file, slightly modify command:

$ grep -qI '' FILE

then, exit status '0' would mean the file is a text; '1' - binary. Could check:

$ echo $?

查看更多
聊天终结者
6楼-- · 2019-01-13 04:39

One simple check is if it has \0 characters. Text files don't have them.

查看更多
Bombasti
7楼-- · 2019-01-13 04:40

Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:

$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'

(Note that those underscores without a preceding dollar are correct (RTFM).)

查看更多
登录 后发表回答