Working with UTF8

2019-06-24 10:48发布

问题:

It seems like a rather complicated issue to work with std::string and UTF8 and I cannot find a good explanation of do's and dont's.

How can I properly work with UTF8 in C++? It is rather confusing.

I've found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

However, after this what do I need to think about, when can I get problems? Will writing/reading from file work as expected, string comparisons etc...?

So far I'm aware of the following:

  • std::regex/boost::regex will not work, In need to covnert to wide strings and use wregex.
  • boost::algorithm::to_upper will not work, need to use boost::locale::to_upper

Other than that what do I need to be aware of?

回答1:

Welcome in the magnificent world of Unicode.

  1. Sorry, wchar_t is implementation defined, and typically on Windows will not be sufficient to hold a full code-point for Asiatic scripts (for example)
  2. You can use comparisons for look-up, but to sort data and present them to an audience you will need a full collation algorithm. Know for example that the order in the German dictionary is different from that in the German phone book (and cry...)
  3. Generally speaking, I would advise not transforming the strings by yourself. Boost.Locale algorithms should work in general as they wrap ICU, but otherwise refrain from ad-hoc operations.
  4. If you split the string in several parts, don't split in the middle of words. It's too easy to either split a character in two (even with code-point aware algorithms, because of diacritics), or even avoiding that, split between two characters (because some cultures consider certain combinations of adjacent characters as one).