How are surrogate pairs calculated?

If a unicode uses 17 bits codepoints, how the surrogate pairs is calculated from code points?

标签： unicode utf-8

2条回答

2楼-- · 2019-05-31 17:53

If it is code you are after, here is how a single codepoint is encoded in UTF-16 and UTF-8 respectively.

A single codepoint to UTF-16 codeunits:

if (cp < 0x10000u)
{
   *out++ = static_cast<uint16_t>(cp);
}
else
{
   *out++ = static_cast<uint16_t>(0xd800u + (((cp - 0x10000u) >> 10) & 0x3ffu));
   *out++ = static_cast<uint16_t>(0xdc00u + ((cp - 0x10000u) & 0x3ffu));
}

A single codepoint to UTF-8 codeunits:

if (cp < 0x80u)
{
   *out++ = static_cast<uint8_t>(cp);
}
else if (cp < 0x800u)
{
   *out++ = static_cast<uint8_t>((cp >> 6) & 0x1fu | 0xc0u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else if (cp < 0x10000u)
{
   *out++ = static_cast<uint8_t>((cp >> 12) & 0x0fu | 0xe0u);
   *out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else
{
   *out++ = static_cast<uint8_t>((cp >> 18) & 0x07u | 0xf0u);
   *out++ = static_cast<uint8_t>(((cp >> 12) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-05-31 18:02

Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.

Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.

Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There's a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.

This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.

0人赞添加讨论(0) 举报

How are surrogate pairs calculated?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间