How do I distinguish between 'binary' and

2019-01-13 04:13发布

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).

In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.

However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.

So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.

11条回答
家丑人穷心不美
2楼-- · 2019-01-13 04:42

You can determine the MIME type of the file with

file --mime FILENAME

The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).

If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.

查看更多
Lonely孤独者°
3楼-- · 2019-01-13 04:42

Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.

查看更多
The star\"
4楼-- · 2019-01-13 04:44

Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.

To distinguish a unicode text file, MSDN offers some great advice as to what to do.

The gist of it is to first inspect up to the first four bytes:

EF BB BF     UTF-8 
FF FE        UTF-16, little endian 
FE FF        UTF-16, big endian 
FF FE 00 00  UTF-32, little endian 
00 00 FE FF  UTF-32, big-endian 

That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.

查看更多
\"骚年 ilove
5楼-- · 2019-01-13 04:48

As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.

This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.

The structure and description of the magic file can be found by consulting the relevant manual page (man magic)

As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following

/* Make sure we are dealing with ascii text before looking for tokens */
    for (i = 0; i < nbytes - 1; i++) {
        if (!isascii(buf[i]) ||
            (iscntrl(buf[i]) && !isspace(buf[i]) &&
             buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
            )
           )
            return 0;   /* not all ASCII */
    }
查看更多
可以哭但决不认输i
6楼-- · 2019-01-13 04:52

The spreadsheet software my company makes reads a number of binary file formats as well as text files.

We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.

查看更多
登录 后发表回答