What is a “surrogate pair” in Java?

2019-01-01 14:47发布

I was reading the documentation for StringBuffer, in particular the reverse() method. That documentation mentions something about surrogate pairs. What is a surrogate pair in this context? And what are low and high surrogates?

5条回答
倾城一夜雪
2楼-- · 2019-01-01 15:19

A surrogate pair is two 'code units' in UTF-16 that make up one 'code point'. The Java documentation is stating that these 'code points' will still be valid, with their 'code units' ordered correctly, after the reverse. It further states that two unpaired surrogate code units may be reversed and form a valid surrogate pair. Which means that if there are unpaired code units, then there is a chance that the reverse of the reverse may not be the same!

Notice, though, the documentation says nothing about Graphemes -- which are multiple codepoints combined. Which means e and the accent that goes along with it may still be switched, thus placing the accent before the e. Which means if there is another vowel before the e it may get the accent that was on the e.

Yikes!

查看更多
若你有天会懂
3楼-- · 2019-01-01 15:22

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.

In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.

Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.

查看更多
千与千寻千般痛.
4楼-- · 2019-01-01 15:29

Surrogate pairs refer to UTF-16's way of encoding certain characters, see http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000..U.2B10FFFF

查看更多
深知你不懂我心
5楼-- · 2019-01-01 15:40

Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits. Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values — called code points — were adopted for the UTF-32 encoding scheme. But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates(in the range U+D800 to U+DBFF) and another 1,024 values to 16-bit low surrogates(in the range U+DC00 to U+DFFF). It uses a high surrogate followed by a low surrogate — a surrogate pair — to represent (the product of 1,024 and 1,024)1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) .

查看更多
不流泪的眼
6楼-- · 2019-01-01 15:42

What that documentation is saying is that invalid UTF-16 strings may become valid after calling the reverse method since they might be the reverses of valid strings. A surrogate pair (discussed here) is a pair of 16-bit values in UTF-16 that encode a single Unicode code point; the low and high surrogates are the two halves of that encoding.

查看更多
登录 后发表回答