Unicode literal - how does this even make sense?

int main() {    
    std::cout << "\u2654" << std::endl; // Result #1: ♔
    std::cout << U'\u2654' << std::endl; // Result #2: 9812
    std::cout << U'♔' << std::endl; // Result #3: 9812
    return 0;
}

I am having trouble understanding how Unicode works with C++. Why does not the literal output the literal in the terminal?

I kind of want something like this to work;

char32_t txt_representation() { return /* Unicode codepoint */; }

Note: the source is UTF-8 and so is the terminal, sitting on macOS Sierra, CLion.

标签： c++ unicode character-encoding

4条回答

爷的心禁止访问

2楼-- · 2019-08-02 09:32

Unicode and C++

There are several unicode encodings:

UTF-8 encodes each unicode character into a sequence of one to four (8- bit) bytes (char)
UTF-16 (which can be BE and LE depending on endianness) encodes each unicode character into a sequence of one or two 16 bit words (char16_t).
UTF-32 (again BE or LE) encodes each unicode character into one 32 bit word (char32_t).

Here is an excellent video tutorial on unicode with C++ by James McNellis. He explains everything you need to know on character set encoding, on unicode and its different encodings, and how to use it in C++.

Your code

"\u2654" is a a narrow string literal, that has the type array of char. The white chess king unicode character will be encoded as 3 consecutive chars corresponding to the UTF-8 encoding ({ 0xe2, 0x99, 0x94 }). As we are in a string, there is no problem of having several chars in it. As your console locale certainly uses UTF8, it will interpret correctly decode the sequence when the string is displayed.

U'\u2654' is a character literal of type char32_t (because of the uppercase U). As it is a char32_t (and not a char), it is not displayed as a char, but as an integer value. The value in decimal is 9812. Whould you use hex, you would have recognized it immediately.

The last U'♔' obeys the same logic. Be aware however that you embed a unicode character in the source code. This is fine as long as the editor's character encoding matches the source code encoding expected by the compiler. But this could cause mismatches if file would be copied (without conversion) to environments expecting a different encoding.

0人赞添加讨论(0) 举报

Deceive 欺骗

3楼-- · 2019-08-02 09:32

C++ doesn't really have the concept of "character" in its type system. char, wchar_t, char16_t, and char32_t are all considered to be kinds of integer. As a consequence, character literals like 'x', L'x', U'x' are all numbers. There is an operator<< specifically for char, which is why

cout << "endl is almost never necessary" << '\n';

does the same thing as

cout << "endl is almost never necessary\n";

but there aren't analogues for *char_t, so your wide character literals are being silently converted to int and printed as such. I personally never use iostreams and therefore I don't actually know how to persuade operator<< to print a number as its Unicode codepoint, but there's probably some way to do it.

There's a stronger distinction between "string" and "array of integers" in the type system, so you do get the output you expect when you supply a string literal. Note, however, that cout << L"♔" won't give the output you expect, and cout << "♔" isn't even guaranteed to compile. cout << u8"♔" will work on a C++11-compliant system where the narrow character encoding is in fact UTF-8, but will probably produce mojibake if the character encoding is something else.

(Yes, this is all much more complicated and less useful than it has any excuse for being. This is partially because of backward compatibility constraints inherited from C, partially because it was all designed back in the 1990s, before Unicode took over the world, and partially because many of the design errors in the C++ string and stream classes were not apparent as errors until it was too late to fix them.)

0人赞添加讨论(0) 举报

Lonely孤独者°

4楼-- · 2019-08-02 09:40

Printing wide characters to narrow streams is not supported and doesn't work at all. (It "works" but the result is not what you want).

Printing multibyte narrow strings to wide streams is not supported and doesn't work at all. (It "works" but the result is not what you want).

On a Unicode-ready system, std::cout << "\u2654" works as expected. So does std::cout << u8"\u2654". Most properly set up Unix-based operating systems are Unicode-ready.

On a Unicode-ready system, std::wcout << L'\u2654' should work as expected if you set up your program locale properly. This is done with this call:

 ::setlocale(LC_ALL, "");

or this

 ::std::locale::global(::std::locale(""));

Note "should"; with some compilers/libraries this method may not work at all. It's a deficiency with these compilers/libraries. I'm looking at you, libc++. It may or may not officially be a bug, but I view it as a bug.

You should really set up your locale in all programs that wish to work with Unicode, even if this doesn't appear necessary.

Mixing cout and wcout in the same program does not work and is not supported.

std::wcout << U'\u2654' does not work because this is mixing a wchar_t stream with a char32_t character. wchar_t and char32_t are different types. I guess a properly set up std::basic_stream<char32_t> would work with char32_t strings, bit the standard library doesn't provide any.

char32_t based strings are good for storing and processing Unicode code points. Do not use them for formatted input and output directly. std::wstring_convert can be used to convert them back and forth.

TL;DR work with either std::streams and std::strings, or (if you are not on libc++) std::wstreams and std::wstrings.

0人赞添加讨论(0) 举报

我想做一个坏孩纸

5楼-- · 2019-08-02 09:41

On my system I can't mix using std::cout with std::wcout and get sensible results. So you have to do these separately.

You should set the locale to that of the native system using std::locale::global(std::locale(""));.

Also use wide streams for the second two outputs

Either:

std::locale::global(std::locale(""));

std::cout << "\u2654" << std::endl;

Or:

std::locale::global(std::locale(""));

std::wcout << L"\u2654" << std::endl;
std::wcout << L'♔' << std::endl;

That should encourage the output streams to convert between the local system's encoding and either utf8 (1st example) or ucs16/utf32 (2nd example).

I think to be safest with the first example (editors can have other encodings) it is best to prefix the string with u8:

std::cout << u8"\u2654" << std::endl;

0人赞添加讨论(0) 举报

Unicode literal - how does this even make sense?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间