How to read/store unicode with STL strings and str

2019-02-07 01:17发布

问题:

I need to modify my program to accept Unicode, which may come from any of UTF-8 and the various UTF-16 and UTF-32 encodings. I don't really know much about Unicode (though I've read Joel Spolsky's article and the Wikipedia page).

Right now I'm using an std::istream and reading my input char by char, and then storing (when necessary) in an std::string. I'd like to

  • modify this (with as little effort) to support the above encodings, and
  • figure out how to test the above encodings (I'm kinda white-bread American, and don't really know how to even make a sample text file in another encoding), and ideally
  • do this in a cross-platform way.

Also, if possible, I'd like to conserve space as much as possible (so if we don't need more than a byte/character, we don't use it). From what I understand, this means storing in UTF-8, which is fine, but I don't know of a standard string that does this (from what I understand, wchar_t has implementation-defined size and encoding).

回答1:

UTF-8 conserves space, as long as you are primarily using the standard ASCII characters.

std::string has no problem with UTF-8, as there is no 0 bytes in it. You can tell std::string how long the inputs chars are, if they have NULL bytes, as in UTF-32. std::string wouldn't be able to tell you how many characters your UTF-8 string is, you would have to use an external function.

Also, there is a wide version of the std::string using wchar_t, as opposed to char, I just forget the name.

Also there are facets in boost for transforming between encodings.

You can either use the standard library with boost. Or you can use the string handling functions from the C library. There are also functions provided by programming frameworks such as Qt and Tcl.

See for example:

utf8 codecvt facet



回答2:

Have a look at the Switching from std::string to std::wstring for embedded applications? question

As Pukku said: You might get some headache because of the fact that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file, and how this conversion is done is implementation-dependent.