From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using
u8"..."
strings.That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (
char
,wchar_t
, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string whee each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.If the execution character set of the compiler is set to UTF-8, it makes no difference if
u8
is used or not, since the compiler converts the characters to UTF-8 in both cases.However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when
u8
is omitted. For example, the conversion to wide strings will crash e.g. in VS15:The encoding of
"Test String"
is the implementation-defined system encoding (the narrow, possibly multibyte one).The encoding of
u8"Test String"
is always UTF-8.The examples aren't terribly telling. If you included some Unicode literals (such as
\U0010FFFF
) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the
u8
-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.You quote Wikipedia:
Well, the “For the purpose of” is not true.
char
has always been guaranteed to be at least 8 bits, that is,CHAR_BIT
has always been required to be ≥8, due to the range required forchar
in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the
u8
literal prefix, itaffects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases
"tørrfisk"
andu8"tørrfisk"
you get achar const[n]
. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.