How to get \\uXXXX to display correctly, using PHP

2019-01-20 13:39发布

问题:

I have inherited a database which contains strings such as:

\u5353\u8d8a\u4e9a\u9a6c\u900a: \u7f51\u4e0a\u8d2d\u7269: \u5728\u7ebf\u9500\u552e\u56fe\u4e66\uff0cDVD\uff0cCD\uff0c\u6570\u7801\uff0c\u73a9\u5177\uff0c\u5bb6\u5c45\uff0c\u5316\u5986

The question is, how do I get this to be displayed properly in an HTML page?

I'm using PHP5 to process the strings.

回答1:

1) I downloaded and installed a unicode font named CODE2000

2) I wrote this:

<?php header('Content-Type: text/html;charset=utf-8'); ?>
<head></head>
<body style="font-family: CODE2000">
<?php
// I had to remove some strings like ': ', 'DVD', 'CD' to make it in \uXXXX format
$s = '\u5353\u8d8a\u4e9a\u9a6c\u900a\u7f51\u4e0a\u8d2d\u7269\u5728\u7ebf\u9500\u552e\u56fe\u4e66\uff0c\uff0c\uff0c\u6570\u7801\uff0c\u73a9\u5177\uff0c\u5bb6\u5c45\uff0c\u5316\u5986';
$chars = explode('\\u', $s);
foreach ($chars as $char) {
  $c = iconv('utf-16', 'utf-8', hex2str($char));
  print $c;
}

function hex2str($hex) {
  $r = '';
  for ($i = 0; $i < strlen($hex) - 1; $i += 2)
    $r .= chr(hexdec($hex[$i] . $hex[$i + 1]));
  return $r;
}
?>
</body>
</html>

3) It produced this characters http://img267.imageshack.us/img267/9759/49139858.png which could be correct. E.g. the 1st character (5353) is indeed this while the 2nd one (8d8a) is this. Of course I cannot be 100% sure but it seems to fit. Maybe you can take it from here.

That was a good exercise :)



回答2:

PHP < 6 is woefully unaware of Unicode, so you have to do everything yourself:

  • Make sure that your database is using a Unicode-capable encoding for its connections. In MySQL for example, the directive is default-character-set = . UTF-8 is a reasonable choice
  • Let the browser know which encoding you are using. There are several ways of doing this:

    1. Set a charset value in the Content-Type header. Something like header('Content-Type: text/html;charset=utf-8');

    2. Use a <meta http-equiv> version of the above header.

    3. Set the XML encoding parameter <?xml encoding="utf-8"?>

Option 1. takes precedence over 2. I'm not sure where 3. fits in.

If you need to do any string processing prior to displaying the data, make sure you use the multibyte (mb_*) string functions. If you have Unicode data coming from other sources in other encodings, you'll need to use mb_convert_encoding.



回答3:

Based on daremon's submission, here is a "unicode_decode" function which will convert \uXXXX into their UTF counterparts.

function unicode_decode($str){
    return preg_replace("/\\\u([0-9A-F]{4})/ie", "iconv('utf-16', 'utf-8', hex2str(\"$1\"))", $str);    
}
function hex2str($hex) {
    $r = '';
    for ($i = 0; $i < strlen($hex) - 1; $i += 2)
    $r .= chr(hexdec($hex[$i] . $hex[$i + 1]));
    return $r;
}