Get file encoding [duplicate]

2019-02-15 11:56发布

Possible Duplicate:
Detect file encoding in PHP

How can I figure out with PHP what file encoding a file has?

5条回答
Evening l夕情丶
2楼-- · 2019-02-15 12:28

Detecting the encoding is really hard for all 8 bit character sets but utf-8 (because not every 8 bit byte sequence is valid utf-8) and usually requires semantic knowledge of the text for which the encoding is to be detected.

Think of it: Any particular plain text information is just a bunch of bytes with no encoding information associated. If you look at any particular byte, it could mean anything, so to have a chance at detecting the encoding, you would have to look at that byte in context of other bytes and try some heuristics based on possible language combination.

For 8bit character sets you can never be sure though.

A demonstration of heuristics going wrong is here for example:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

Some 16bit sets, you have a chance at detecting because they might include a byte order mark or have every second byte set to 0.

If you just want to detect UTF-8, you can either use mb_detect_encoding as already explained, or you can use this handy little function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
    )+%xs', $string);
}
查看更多
虎瘦雄心在
3楼-- · 2019-02-15 12:31

You can't really, unless the file is kind enough to tell you somewhere inside it.

For example, HTML files are meant to contain a content-type meta tag near the top, so that your web browser knows what encoding is used.. eg

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

or

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

There are methods that try to guess by looking at the file and spotting byte sequences that suggest certain encodings, but these are really only guessing.

查看更多
时光不老,我们不散
4楼-- · 2019-02-15 12:33

BlackAura's suggestion is very good, IMHO.

Another option is to call file(1) on the file in question using system() or the like. Often, it is able to tell you the encoding as well. It should be available in any sane UNIX environment.

查看更多
三岁会撩人
5楼-- · 2019-02-15 12:35

You can use the fread() function to look at the first few bytes of the file for the "magic number", and then map that magic number against a list of known magic numbers for file types.

查看更多
贪生不怕死
6楼-- · 2019-02-15 12:40

mb_detect_encoding should be able to do the job.

http://us.php.net/manual/en/function.mb-detect-encoding.php

In it's default setup, it'll only detect ASCII, UTF-8, and a few Japanese JIS variants. It can be configured to detect more encodings, if you specify them manually. If a file is both ASCII and UTF-8, it'll return UTF-8.

查看更多
登录 后发表回答