Are UTF16 (as used by for example wide-winapi func

2019-03-30 02:02发布

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

  • There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)
  • Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
  • There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
  • To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
  • UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2? And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

UPDATE: Now I see that character-counting is not necessarily a standard-thing or a c++ thing even, so I'll try to be a little more specific in my second question, about the length in "characters" of a wide string:

On Windows, specifically, in Winapi, in their wide functions (ending with W), how does one count the numer of characters in a string that consists of 2 unicode codepoints, each consisting of 2 codeunits (total of 8 bytes)? Is such a string 2 characters long (the same as number of codepoints) or 4 characters long(the same as total number of codeunits?)

Or, being more generic: What does the windows definition of "number of characters in a wide string" mean, number of codepoints or number of codeunits?

8条回答
ら.Afraid
2楼-- · 2019-03-30 02:30

All characters in the Basic Multilingual Plane will be 2 bytes long.

Characters in other planes will be encoded into 4 bytes each, in the form of a surrogate pair.

Obviously, if a function does not try to detect surrogate pairs and blindly treats each pair of bytes as a character, it will bug out on strings that contain such pairs.

查看更多
太酷不给撩
3楼-- · 2019-03-30 02:34

Windows' WCHAR is 16 bits (2 bytes) long.

A Unicode codepoint may be represented by one or two of these WCHAR – 16 or 32 bits (2 or 4 bytes).

wcslen returns number of WCHAR units in a wide string, while wcslen_l returns the number of (locale-dependent) codepoints. Obviously, wcslen <= wcslen_l.

A Unicode character may consist of multiple combining codepoints.

查看更多
ら.Afraid
4楼-- · 2019-03-30 02:35

There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

Well WCHAR is an MS thing not a C++ thing.
But there is a wchar_t for wide character. Though this is not always 2. On Linux system it is usually 4 bytes.

Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.

Do they. I can believe it.

There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

C/C++ make no assumption avout character encoding. Though the OS can. For example Windows uses UTF-16 as the interface while a lot of Linus use UTF-32. But you need to read the documentation for each interface to know explicitly.

To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.

2 bytes is all you need for numbers 0 -> 65535

But UCS (the encoding that UTF is based on) has 20 bits per code point. Thus some code points are encoded as 2 16byte characters in UTF-16 (These are refereed to as surrogate pairs).

UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

UTF-8/UTF-16 and UTF-32 all encode the same set of code points (which are 20 bytes per code point). UTF-32 is the only one that has a fixed size (UTF-16 was supposed to be fixed size but then they found lots of other characters (Like Klingon) that we needed to encode and we ran out of space in plane 0. So we added 32 more plains (hence the four extra bits).

So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2?

It is either 1 16 bit character or 2 16 bit characters.

And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

You have to step along and calculate each character one at a time.

Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

All depneds on your system

查看更多
唯我独甜
5楼-- · 2019-03-30 02:35

This Wikipedia article seems to be a good intro.

UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.

查看更多
贼婆χ
6楼-- · 2019-03-30 02:36

According to the Unicode FAQ it could be

one or two 16-bit code units

Windows uses 16 bit chars - probably as Unicode was originally 16 bit. So you don't have an exact map - but you might be able to get away with treating all strings you see as just containing 16 but unicode characters,

查看更多
叼着烟拽天下
7楼-- · 2019-03-30 02:39

Short story: UTF-16 is a variable-length encoding. A single character may be one or two widechars long.

HOWEVER, you may very well get away with treating it as a fixed-length encoding where every character is one widechar (2 bytes). This is formally called UCS-2, and it used to be Win32's assumption until Windows NT 4. The UCS-2 charset includes practically all living, dead and constructed human languages. And truth be told, working with variable-length encoding strings just sucks. Iteration becomes O(n) operation, string length is not the same as string size, etc. Any sensible parsing becomes a pain.

As for the UTF-16 chars that are not in UCS-2... I only know two subsets that may theoretically come up in real life. First is emoji - the graphical smileys that are popular in the Japanese cell phone culture. On iPhone, there's a bunch of third-party apps that enable input of those. Except on mobile phones, they don't display properly. The other character class is VERY obscure Chinese characters. The ones even most Chinese don't know. All the popular Chinese characters are well inside UCS-2.

查看更多
登录 后发表回答