int main() {
std::cout << "\u2654" << std::endl; // Result #1: ♔
std::cout << U'\u2654' << std::endl; // Result #2: 9812
std::cout << U'♔' << std::endl; // Result #3: 9812
return 0;
}
I am having trouble understanding how Unicode works with C++. Why does not the literal output the literal in the terminal?
I kind of want something like this to work;
char32_t txt_representation() { return /* Unicode codepoint */; }
Note: the source is UTF-8 and so is the terminal, sitting on macOS Sierra, CLion.
Unicode and C++
There are several unicode encodings:
char
)char16_t
).char32_t
).Here is an excellent video tutorial on unicode with C++ by James McNellis. He explains everything you need to know on character set encoding, on unicode and its different encodings, and how to use it in C++.
Your code
"\u2654"
is a a narrow string literal, that has the type array ofchar
. The white chess king unicode character will be encoded as 3 consecutive chars corresponding to the UTF-8 encoding ({ 0xe2, 0x99, 0x94 }
). As we are in a string, there is no problem of having several chars in it. As your console locale certainly uses UTF8, it will interpret correctly decode the sequence when the string is displayed.U'\u2654'
is a character literal of typechar32_t
(because of the uppercase U). As it is a char32_t (and not a char), it is not displayed as a char, but as an integer value. The value in decimal is 9812. Whould you use hex, you would have recognized it immediately.The last
U'♔'
obeys the same logic. Be aware however that you embed a unicode character in the source code. This is fine as long as the editor's character encoding matches the source code encoding expected by the compiler. But this could cause mismatches if file would be copied (without conversion) to environments expecting a different encoding.C++ doesn't really have the concept of "character" in its type system.
char
,wchar_t
,char16_t
, andchar32_t
are all considered to be kinds of integer. As a consequence, character literals like'x'
,L'x'
,U'x'
are all numbers. There is anoperator<<
specifically forchar
, which is whydoes the same thing as
but there aren't analogues for
*char_t
, so your wide character literals are being silently converted toint
and printed as such. I personally never use iostreams and therefore I don't actually know how to persuadeoperator<<
to print a number as its Unicode codepoint, but there's probably some way to do it.There's a stronger distinction between "string" and "array of integers" in the type system, so you do get the output you expect when you supply a string literal. Note, however, that
cout << L"♔"
won't give the output you expect, andcout << "♔"
isn't even guaranteed to compile.cout << u8"♔"
will work on a C++11-compliant system where the narrow character encoding is in fact UTF-8, but will probably produce mojibake if the character encoding is something else.(Yes, this is all much more complicated and less useful than it has any excuse for being. This is partially because of backward compatibility constraints inherited from C, partially because it was all designed back in the 1990s, before Unicode took over the world, and partially because many of the design errors in the C++ string and stream classes were not apparent as errors until it was too late to fix them.)
Printing wide characters to narrow streams is not supported and doesn't work at all. (It "works" but the result is not what you want).
Printing multibyte narrow strings to wide streams is not supported and doesn't work at all. (It "works" but the result is not what you want).
On a Unicode-ready system,
std::cout << "\u2654"
works as expected. So doesstd::cout << u8"\u2654"
. Most properly set up Unix-based operating systems are Unicode-ready.On a Unicode-ready system,
std::wcout << L'\u2654'
should work as expected if you set up your program locale properly. This is done with this call:or this
Note "should"; with some compilers/libraries this method may not work at all. It's a deficiency with these compilers/libraries. I'm looking at you, libc++. It may or may not officially be a bug, but I view it as a bug.
You should really set up your locale in all programs that wish to work with Unicode, even if this doesn't appear necessary.
Mixing
cout
andwcout
in the same program does not work and is not supported.std::wcout << U'\u2654'
does not work because this is mixing awchar_t
stream with achar32_t
character.wchar_t
andchar32_t
are different types. I guess a properly set upstd::basic_stream<char32_t>
would work withchar32_t
strings, bit the standard library doesn't provide any.char32_t
based strings are good for storing and processing Unicode code points. Do not use them for formatted input and output directly. std::wstring_convert can be used to convert them back and forth.TL;DR work with either
std::stream
s andstd::string
s, or (if you are not on libc++)std::wstream
s andstd::wstring
s.On my system I can't mix using
std::cout
withstd::wcout
and get sensible results. So you have to do these separately.You should set the locale to that of the native system using
std::locale::global(std::locale(""));
.Also use wide streams for the second two outputs
Either:
Or:
That should encourage the output streams to convert between the local system's encoding and either
utf8
(1st example) orucs16/utf32
(2nd example).I think to be safest with the first example (editors can have other encodings) it is best to prefix the string with
u8
: