php's json_encode and character representation

2019-09-06 19:54发布

问题:

I'll try to present it as simple as I can: I use json_encode() to encode a number of utf-8 strings from different languages and I notice that characters remain unchanged when they belong to ASCII table but everything else is returned as '\unnnn', where 'nnnn' a hexadecimal number.

See the code:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" />
 <title>Multibyte string functions</title>
</head>
<body>
<h3>Multibyte string functions</h3>
<p>
<?php
//present json encode errors nicely:
//assign integer values to keys and error names to values
echo '<br /><b>Define JSON errors</b><br />';
$constants = get_defined_constants(true);
$json_errors = array();
foreach ($constants["json"] as $name => $value) {
   if (!strncmp($name, "JSON_ERROR_", 11)) {
      $json_errors[$value] = $name;
   }
}
echo nl2br(print_r($json_errors, true), true);

//Display current detection order
echo "<br /><b>Current detection order 'mb_detect_order()':</b> ", implode(", ", mb_detect_order());
//Display internal encoding
echo "<br /><b>Internal encoding 'mb_internal_encoding()':</b> ",  mb_internal_encoding();
//Get current language
echo "<br /><b>Current detection language 'mb_language()' ('neutral' for utf8):</b> ", mb_language();

//our test data
//a nowdoc that can break a <input> field;
$str = <<<'STR'
O'Reilly(\n) "& 'Big\Two @ <span>bo\tld</span>"
STR;
$strings = array(
   $str,
   "Latin: tell me the answer and I might find the question!",
   "Greek: πες μου την ερώτηση και ίσως βρω την απάντηση!",
   "Chinese simplified: 告诉我答复,并且我也许发现问题!",
   "Arabic: أخبرني الاجابة, انا قد تجد مسالة!",
   "Portuguese: mais coisas a pensar sobre diário ou dois!",
   "French: plus de choses à penser à journalier ou à deux!",
   "Spanish: ¡más cosas a pensar en diario o dos!",
   "Italian: più cose da pensare circa giornaliere o due!",
   "Danish: flere ting å tenke på hver dag eller to!",
   "Chech: Další věcí, přemýšlet o každý den nebo dva!",
   "German: mehr über Spaß spät schönen",
   "Albanian: më vonë gjatë fun bukur",
   "Hungarian: több mint szórakozás késő csodálatos kenyér"
);

//show encoding and then encode
foreach( $strings as $string ){
   echo "<br /><br />$string :", mb_detect_encoding($string);
   $json = json_encode($string);
   echo "<br />Error? ", $json_errors[json_last_error()];
   echo '<br />json=', $json;
}

The above code will output:

Define JSON errors
Array
(
[0] => JSON_ERROR_NONE
[1] => JSON_ERROR_DEPTH
[2] => JSON_ERROR_STATE_MISMATCH
[3] => JSON_ERROR_CTRL_CHAR
[4] => JSON_ERROR_SYNTAX
[5] => JSON_ERROR_UTF8
)

Current detection order 'mb_detect_order()': ASCII, UTF-8
Internal encoding 'mb_internal_encoding()': ISO-8859-1
Current detection language 'mb_language()' ('neutral' for utf8): neutral

O'Reilly(\n) "& 'Big\Two @ bo\tld" :ASCII
Error? JSON_ERROR_NONE
json="O'Reilly(\\n) \"& 'Big\\Two @ bo\\tld<\/span>\""

Latin: tell me the answer and I might find the question! :ASCII
Error? JSON_ERROR_NONE
json="Latin: tell me the answer and I might find the question!"

Greek: πες μου την ερώτηση και ίσως βρω την απάντηση! :UTF-8
Error? JSON_ERROR_NONE
json="Greek: \u03c0\u03b5\u03c2 \u03bc\u03bf\u03c5 \u03c4\u03b7\u03bd \u03b5\u03c1\u03ce\u03c4\u03b7\u03c3\u03b7 \u03ba\u03b1\u03b9 \u03af\u03c3\u03c9\u03c2 \u03b2\u03c1\u03c9 \u03c4\u03b7\u03bd \u03b1\u03c0\u03ac\u03bd\u03c4\u03b7\u03c3\u03b7!"

Chinese simplified: 告诉我答复,并且我也许发现问题! :UTF-8
Error? JSON_ERROR_NONE
json="Chinese simplified: \u544a\u8bc9\u6211\u7b54\u590d\uff0c\u5e76\u4e14\u6211\u4e5f\u8bb8\u53d1\u73b0\u95ee\u9898!"

Arabic: أخبرني الاجابة, انا قد تجد مسالة! :UTF-8
Error? JSON_ERROR_NONE
json="Arabic: \u0623\u062e\u0628\u0631\u0646\u064a \u0627\u0644\u0627\u062c\u0627\u0628\u0629, \u0627\u0646\u0627 \u0642\u062f \u062a\u062c\u062f \u0645\u0633\u0627\u0644\u0629!"

Portuguese: mais coisas a pensar sobre diário ou dois! :UTF-8
Error? JSON_ERROR_NONE
json="Portuguese: mais coisas a pensar sobre di\u00e1rio ou dois!"

French: plus de choses à penser à journalier ou à deux! :UTF-8
Error? JSON_ERROR_NONE
json="French: plus de choses \u00e0 penser \u00e0 journalier ou \u00e0 deux!"

Spanish: ¡más cosas a pensar en diario o dos! :UTF-8
Error? JSON_ERROR_NONE
json="Spanish: \u00a1m\u00e1s cosas a pensar en diario o dos!"

Italian: più cose da pensare circa giornaliere o due! :UTF-8
Error? JSON_ERROR_NONE
json="Italian: pi\u00f9 cose da pensare circa giornaliere o due!"

Danish: flere ting å tenke på hver dag eller to! :UTF-8
Error? JSON_ERROR_NONE
json="Danish: flere ting \u00e5 tenke p\u00e5 hver dag eller to!"

Chech: Další věcí, přemýšlet o každý den nebo dva! :UTF-8
Error? JSON_ERROR_NONE
json="Chech: Dal\u0161\u00ed v\u011bc\u00ed, p\u0159em\u00fd\u0161let o ka\u017ed\u00fd den nebo dva!"

German: mehr über Spaß spät schönen :UTF-8
Error? JSON_ERROR_NONE
json="German: mehr \u00fcber Spa\u00df sp\u00e4t sch\u00f6nen"

Albanian: më vonë gjatë fun bukur :UTF-8
Error? JSON_ERROR_NONE
json="Albanian: m\u00eb von\u00eb gjat\u00eb fun bukur"

Hungarian: több mint szórakozás késő csodálatos kenyér :UTF-8
Error? JSON_ERROR_NONE
json="Hungarian: t\u00f6bb mint sz\u00f3rakoz\u00e1s k\u00e9s\u0151 csod\u00e1latos keny\u00e9r"

As you can see in most languages-except English-there is a hexadecimal conversion of utf-8 characters. Is it possible to encode by not replacing my unicode characters? Is it safe? What other people do?

You should consider such encodings that are coming from user input in pages and stored to mysql.

Thanks.

回答1:

Maybe you should try json_encode($string, JSON_UNESCAPED_UNICODE) , or any method in http://php.net/manual/fr/function.json-encode.php that may be usefull for your various cases.



回答2:

Ok, really thanks for the answer!

The problem is that I'm on version PHP Version 5.3.10 and json_encode($string, JSON_UNESCAPED_UNICODE) isn't an option.

Fortunately, a guy called "Mr Swordsteel" posted a comment at php's manual http://www.php.net/manual/en/function.json-encode.php which actually does the trick (thank you Mr Swordsteel!) The real paradox is that it emulates completely json_encode function and gives a hint if we want to port it to another language like javascript and keep our libraries communicative.

function my_json_encode($in){
   $_escape = function ($str) {
      return addcslashes($str, "\v\t\n\r\f\"\\/");
   };
   $out = "";
   if (is_object($in)){
      $class_vars = get_object_vars(($in));
      $arr = array();
      foreach ($class_vars as $key => $val){
         $arr[$key] = "\"{$_escape($key)}\":\"{$val}\"";
       }
      $val = implode(',', $arr);
      $out .= "{{$val}}";
   }elseif (is_array($in)){
      $obj = false;
      $arr = array();
      foreach($in as $key => $val){
         if(!is_numeric($key)){
            $obj = true;
         }
         $arr[$key] = my_json_encode($val);
      }
      if($obj){
         foreach($arr AS $key => $val){
            $arr[$key] = "\"{$_escape($key)}\":{$val}";
         }
         $val = implode(',', $arr);
         $out .= "{{$val}}";
      }else {
         $val = implode(',', $arr);
         $out .= "[{$val}]";
      }
   }elseif (is_bool($in)){
      $out .= $in ? 'true' : 'false';
   }elseif (is_null($in)){
      $out .= 'null';
   }elseif (is_string($in)){
      $out .= "\"{$_escape($in)}\"";debug('in='.$in.', $_escape($in)='.$_escape($in).', out='.$out);
      }else{
         $out .= $in;
      }
      return "{$out}";
   }

I gave it a lot of tests and couldn't break it! It would be very interesting now to re-implement json_decode!

Thanks.



标签: php utf-8 json