It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.
How can I properly work with UTF8 in C++? It is rather confusing.
I've found boost::locale
and I set the global locale:
std::locale::global(boost::locale::generator()(""));
However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?
So far I'm aware of the following:
std::regex
/boost::regex
will not work, In need to covnert to wide strings and use wregex.
boost::algorithm::to_upper
will not work, need to use boost::locale::to_upper
Other than that what do I need to be aware of?
Welcome in the magnificent world of Unicode.
- Sorry,
wchar_t
is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example)
- You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
- Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
- If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).