Reference: Why are my “special” Unicode characters

2018-12-31 16:52发布

问题:

When using \"special\" Unicode characters they come out as weird garbage when encoded to JSON:

php > echo json_encode([\'foo\' => \'馬\']);
{\"foo\":\"\\u99ac\"}

Why? Have I done something wrong with my encodings?

(This is a reference question to clarify the topic once and for all, since this comes up again and again.)

回答1:

First of all: There\'s nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 \"String Literals\") and is described as such:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as \"\\u005C\".

In short: Any character can be encoded as \\u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

\"馬\"
\"\\u99ac\"

These two string literals represent the exact same character, they\'re absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string \"馬\". They don\'t look the same, but they mean the same thing in the JSON data encoding format.

PHP\'s json_encode preferably encodes non-ASCII characters using \\u.... escape sequences. Technically it doesn\'t have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

php > echo json_encode([\'foo\' => \'馬\'], JSON_UNESCAPED_UNICODE);
{\"foo\":\"馬\"}

To emphasise: this is just a preference, it is not necessary in any way to transport \"Unicode characters\" in JSON.