可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Here are some excerpts from my copy of the 2014 draft standard N4140
22.5 Standard code conversion facets [locale.stdcvt]
3 For each of the three code conversion facets codecvt_utf8
, codecvt_utf16
, and codecvt_utf8_utf16
:
(3.1) — Elem
is the wide-character type, such as wchar_t
, char16_t
, or char32_t
.
4 For the facet codecvt_utf8
:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem
) within the program.
One interpretation of these two paragraphs is that wchar_t
must be encoded as either UCS2 or UCS4. I don't like it much because if it's true, we have an important property of the language buried deep in a library description. I have tried to find a more direct statement of this property, but to no avail.
Another interpretation that wchar_t
encoding is not required to be either UCS2 or UCS4, and on implementations where it isn't, codecvt_utf8
won't work for wchar_t
. I don't like this interpretation much either, because if it's true, and neither char
nor wchar_t
native encodings are Unicode, there doesn't seem to be a way to portably convert between those native encodings and Unicode.
Which of the two interpretations is true? Is there another one which I overlooked?
Clarification I'm not asking about general opinions about suitability of wchar_t
for software development, or properties of wchar_t
one can derive from elsewhere. I am interested in these two specific paragraphs of the standard. I'm trying to understand what these specific paragraphs entail or do not entail.
Clarification 2. If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem. It doesn't. It says what it says. It appears that if one uses std::codecvt_utf8<wchar_t>
, one ends up with a bunch of wchar_t
encoded as UCS2 or UCS4, regardless of the current global locale. (There is no way to specify a locale or any character conversion facet for codecvt_utf8
). So the question can be rephrased like this: is the conversion result directly usable with the current global locale (and/or with any possible locale) for output, wctype
queries and so on? If not, what it is usable for? (If the second interpretation above is correct, the answer would seem to be "nothing").
回答1:
wchar_t
is just an integral literal. It has a min value, a max value, etc.
Its size is not fixed by the standard.
If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t
. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.
In C++11, there are std
APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.
22.5 Standard code conversion facets [locale.stdcvt]
3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:
(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.
4 For the facet codecvt_utf8:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
So here it codecvt_utf8_utf16
deals with utf8
on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.
The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.
This does not mean that wchar_t
is encoded as such, it just means this operation interprets the wchar_t
as being encoded as such.
How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii
string. Maybe you calculated a fixed-point approximation of the log*
of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.
Similar claims hold in other cases. This does not mandate what format wchar_t
have. It simply states how these facets interpret wchar_t
or char16_t
or char32_t
or char8_t
(reading or writing).
Other ways of interacting with wchar_t
use different methods to mandate how the value of the wchar_t
is interpreted.
iswalpha
uses the (global) locale to interpret the wchar_t
, for example. In some locals, the wchar_t
may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.
To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.
The C++ standard does not mandate what is stored in a wchar_t
. It does mandate what certain operations interpret the contents of a wchar_t
to be. That section describes how some facets interpret the data in a wchar_t
.
回答2:
No.
wchar
is only required to hold the biggest locale supported by the compiler. Which could theoretically fit in a char.
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).
— C++ [basic.fundamental] 3.9.1/5
as such it's not even required to support Unicode
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.
ISO/IEC 10646:2003 Unicode standard 4.0
回答3:
Let us differentiate between wchar_t
and string literals built using the L
prefix.
wchar_t
is just an integer type, which may be larger than char
.
String literals using the L
prefix will generate strings using wchar_t
characters. Exactly what that means is implementation-dependent. There is no requirement that such literals use any particular encoding. They might use UTF-16, UTF-32, or something else that has nothing to do with Unicode at all.
So if you want a string literal which is guaranteed to be encoded in a Unicode format, across all platforms, use u8
, u
, or U
prefixes for the string literal.
One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4.
No, that is not a valid interpretation. wchar_t
has no encoding; it's just a type. It is data which is encoded. A string literal prefixed by L
may or may not be encoded in UCS2 or UCS4.
If you provide codecvt_utf8
a string of wchar_t
s which are encoded in UCS2 or UCS4 (as appropriate to sizeof(wchar_t)
), then it will work. But not because of wchar_t
; it only works because the data you provide it is correctly encoded.
If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem.
The whole point of those codecvt_*
facets is to perform locale-independent conversions. If you want locale-dependent conversions, you shouldn't use them. You should instead use the global codecvt
facet.
回答4:
It appears your first conclusion is shared by Microsoft who enumerate the possible options, and note that UTF-16, although "widely used as such[sic]" is not a valid encoding.
The same wording is also used by QNX, which points at the source of the wording: Both QNX and Microsoft derive their Standard Library implementation from Dinkumware.
Now, as it happens, Dinkumware is also the author of N2401 which introduced these classes. So I'm going to side with them.
回答5:
As Elem
can be wchar_t
, char16_t
, or char32_t
, the clause 4.1 says nothing about a required wchar_t
encoding. It states something about the conversion performed.
From the wording, it is clear that the conversion is between UTF-8 and either UCS-2 or UCS-4, depending on the size of Elem
. So if wchar_t
is 16 bits, the conversion will be with UCS-2, and if it is 32 bits, UCS-4.
Why does the standard mention UCS-2 and UCS-4 and not UTF-16 and UTF-32 ? Because codecvt_utf8
will convert a multi-byte UTF8 to a single wide character:
- UCS-2 is a subset of unicode, but there is no surogate pair encoding contrary to UTF-16
- UCS-4 is the same as UTF-32, now (but looking at the growing number of emojis, maybe one day there couldn't be enough of 32 bits, and you would have a UTF-64, and UTF32 surrogate pairs that would not be supported by
codecvt_utf8
)
Although, it is not clear to me what will happen, if an UTF-8 text would contain a sequence corresponds to a unicode character that is not available in UCS-2 used for a receiving char16_t
.
回答6:
Both your interpretations are incorrect. The standard doesn't require that there be a single wchar_t
encoding, just like it doesn't require a single char
encoding. The codecvt_utf8
facet must convert between UTF-8 and UCS-2 or UCS-4.
This true even UTF-8, UCS-2, and UCS-4 are not supported as character sets in any locale.
If Elem
is of type wchar_t
and isn't big enough to store a UCS-2 value than then the conversion operations of the codecvt_utf8
facet are undefined because the standard doesn't say what happens in that case. If it is big enough (or if you want to argue that the standard requires that it must be big enough) then it's merely implementation defined whether the UCS-2 or UCS-4 wchar_t
values the facet generates or consumes are in an encoding compatible with any locale defined wchar_t
encoding.
回答7:
The first interpretation is conditionally true.
If __STDC_ISO_10646__
macro (imported from C) is defined, then wchar_t
is a superset of some version of Unicode.
__STDC_ISO_10646__
An integer literal of the form yyyymmL
(for example, 199712L
). If this symbol is defined, then every
character in the Unicode required set, when stored in an object of type wchar_t
, has the same value
as the short identifier of that character. The Unicode required set consists of all the characters that
are defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified
year and month.
It appears that if the macro is defined, some kind of UCS4 can be assumed. (Not UCS2 as ISO 10646 never had a 16-bit version; the first release of ISO 10646 corresponds to Unicode 2.0).
So if the macro is defined, then
- there is a "native" wchar_t encoding
- it is a superset of some version of UCS4
- the conversion provided by
codecvt_utf8<wchar_t>
is compatible with this native encoding
None of these things are required to hold if the macro is not defined.
There are also __STDC_UTF_16__
and __STDC_UTF_32__
but the C++ standard doesn't say what they mean. The C standard says that they signify UTF-16 and UTF-32 encodings for char16_t
and char32_t
respectively, but in C++ these encodings are always used.
Incidentally, the functions mbrtoc32
and c32rtomb
convert back and forth between char
sequences and char32_t
sequences. In C they only use UTF-32 if __STDC_UTF_32__
is defined, but in C++ UTF-32 is always used for char32_t
. So it would appear than even if __STDC_ISO_10646__
is not defined, it should be possible to convert between UTF-8 and wchar_t
by going from UTF-8 to UTF-32-encoded char32_t
to natively encoded char
to natively encoded wchar_t
, but I'm afraid of this complex stuff.