I've read and heard that C++11 supports Unicode. A few questions on that:
- How well does the C++ standard library support Unicode?
- Does
std::string
do what it should? - How do I use it?
- Where are potential problems?
I've read and heard that C++11 supports Unicode. A few questions on that:
std::string
do what it should?
However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for
std::string
/std::wstring
. It aims to fill the gap of the still missing utf8-string container class.This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded
char
s.You can safely store UTF-8 in a
std::string
(or in achar[]
orchar*
, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (includingstd::cout
andstd::cerr
, so long as your locale is UTF-8).What you cannot do with
std::string
for UTF-8 is get length in code points.std::string::size()
will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.If you need to operate on UTF-8 strings at the code point level---not just store and print them---or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.
Unicode is not supported by Standard Library (for any reasonable meaning of supported).
std::string
is no better thanstd::vector<char>
: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes, ...) you are out of luck.
The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.
Terribly.
A quick scan through the library facilities that might provide Unicode support gives me this list:
I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.
Yes. According to the C++ standard, this is what
std::string
and its siblings should do:Well,
std::string
does that just fine. Does that provide any Unicode-specific functionality? No.Should it? Probably not.
std::string
is fine as a sequence ofchar
objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.Use it as a sequence of
char
objects; pretending it is something else is bound to end in pain.All over the place? Let's see...
Strings library
The strings library provides us
basic_string
, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world:
c16rtomb
/mbrtoc16
andc32rtomb
/mbrtoc32
.Localization library
The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.
Consider, for example, what the standard calls "convenience interfaces" in the
<locale>
header:How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in
u8"
C++11 has a couple of new literal string types for Unicode.
Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.