How well is Unicode supported in C++11?

2019-01-01 06:36发布

I've read and heard that C++11 supports Unicode. A few questions on that:

  • How well does the C++ standard library support Unicode?
  • Does std::string do what it should?
  • How do I use it?
  • Where are potential problems?

5条回答
姐姐魅力值爆表
2楼-- · 2019-01-01 07:08

However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.

This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

查看更多
与君花间醉酒
3楼-- · 2019-01-01 07:09

You can safely store UTF-8 in a std::string (or in a char[] or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::cout and std::cerr, so long as your locale is UTF-8).

What you cannot do with std::string for UTF-8 is get length in code points. std::string::size() will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.

If you need to operate on UTF-8 strings at the code point level---not just store and print them---or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.

查看更多
还给你的自由
4楼-- · 2019-01-01 07:11

Unicode is not supported by Standard Library (for any reasonable meaning of supported).

std::string is no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.

If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes, ...) you are out of luck.

The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.

查看更多
听够珍惜
5楼-- · 2019-01-01 07:13

How well does the C++ standard library support unicode?

Terribly.

A quick scan through the library facilities that might provide Unicode support gives me this list:

  • Strings library
  • Localization library
  • Input/output library
  • Regular expressions library

I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.

Does std::string do what it should?

Yes. According to the C++ standard, this is what std::string and its siblings should do:

The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.

Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.

Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.

How do I use it?

Use it as a sequence of char objects; pretending it is something else is bound to end in pain.

Where are potential problems?

All over the place? Let's see...

Strings library

The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.

It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.

Localization library

The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.

Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:

template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...

How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"

查看更多
爱死公子算了
6楼-- · 2019-01-01 07:13

C++11 has a couple of new literal string types for Unicode.

Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.

查看更多
登录 后发表回答