Why does code points between U+D800 and U+DBFF gen

2019-09-12 05:49发布

问题:

I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers?

I'm not asking how JavaScript/ECMAScript encodes Strings natively, I'm asking about an extra functionality to encode UTF-16 that makes use of UCS-2.

var str1 = '\u{D800}';
var str2 = String.fromCodePoint(0xD800);

console.log(
  str1.length, str1.charCodeAt(0), str1.charCodeAt(1)
);

console.log(
  str2.length, str2.charCodeAt(0), str2.charCodeAt(1)
);

Re-TL;DR: I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string, since my browser's ES6 implementation incorporates UCS-2 encoding in strings, which uses 2 bytes for each character code?

Both of these approaches return a one-element String for the U+D800 code point (char code: 55296, same as 0xD800). But for code points bigger than U+FFFF each one returns a two-element String, the lead and trail. lead would be a number between U+D800 and U+DBFF, and trail I'm not sure about, I only know it helps changing the result code point. For me the return value doesn't make sense, it represents a lead without trail. Am I understanding something wrong?

回答1:

I think your confusion is about how Unicode encodings work in general, so let me try to explain.

Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.

In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.

Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.

It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.

Now, onto Javascript...

Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous