Ensuring valid utf-8 in PHP

2019-01-10 21:37发布

问题:

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO-8859-1, or perhaps WINDOWS-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this? For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    } else {
        return $string;
    }
}

回答1:

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);


回答2:

With mbstring library, you have mb_check_encoding().

Example of use:

mb_check_encoding($string, 'UTF-8');

When performance matters, this is faster than the regex provided in the accepted answer.

A quick test on my configuration shows (for 20,000 iterations):

  • regex: ~310ms
  • mb_check_encoding: ~90ms

EDIT

With PHP 7.1.9 on a recent Windows 10 system, the regex solution outperforms mb_check_encoding() for any string length (still 20,000 iterations):

  • 10 chars: regex => 4ms, mb_check_encoding() => 64ms
  • 10000 chars: regex => 125ms, mb_check_encoding() => 2.4s


回答3:

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }


回答4:

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about charsets. This page links to a page specifically for utf8.



回答5:

answer to "iconv is idempotent"

neither is iconv - iconv is not idempotent

a big difference between utf8_encode() & iconv() is that iconv may raise errors like this "Detected an incomplete multibyte character in input string" even with

iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)

in the above code:

$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

you have to know mb_detect_encoding can answer uft-8 even for invalid utf-8 strings (badly formed utf8)



回答6:

Not sure if this would achieve the same thing, but couldn't you just use utf8_encode() on all text without worrying about detection? If the text is already UTF-8, it won't hurt it. And if it's not, it will be converted. If you've already thought about doing this, is there a reason this wouldn't work for you?