Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideChar and WideCharToMultiByte. However, the code is compiled for multiple OSes and I'm limited to standard C++ library.
相关问题
- Sorting 3 numbers without branching [closed]
- How to compile C++ code in GDB?
- Why does const allow implicit conversion of refere
- thread_local variables initialization
- What uses more memory in c++? An 2 ints or 2 funct
相关文章
- Class layout in C++: Why are members sometimes ord
- How to mock methods return object with deleted cop
- Which is the best way to multiply a large and spar
- C++ default constructor does not initialize pointe
- Selecting only the first few characters in a strin
- What exactly do pointers store? (C++)
- Converting glm::lookat matrix to quaternion and ba
- What is the correct way to declare and use a FILE
You can extract
utf8_codecvt_facet
from Boost serialization library.Their usage example:
Look for
utf8_codecvt_facet.hpp
andutf8_codecvt_facet.cpp
files in boost sources.UTFConverter - check out this library. It does such a convertion, but you need also ConvertUTF class - I've found it here
I don't think there's a portable way of doing this. C++ doesn't know the encoding of its multibyte characters.
As Chris suggested, your best bet is to play with codecvt.
I've asked this question 5 years ago. This thread was very helpful for me back then, I came to a conclusion, then I moved on with my project. It is funny that I needed something similar recently, totally unrelated to that project from the past. As I was researching for possible solutions, I stumbled upon my own question :)
The solution I chose now is based on C++11. The boost libraries that Constantin mentions in his answer are now part of the standard. If we replace std::wstring with the new string type std::u16string, then the conversions will look like this:
UTF-8 to UTF-16
UTF-16 to UTF-8
As seen from the other answers, there are multiple approaches to the problem. That's why I refrain from picking an accepted answer.
The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.
Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.
The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.
Here's a quick implementation of
wchar_t
to UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.The above code works for both UTF-16 and UTF-32 input, simply because the range
d800
throughdfff
are invalid code points; they indicate that you're decoding UTF-16. If you know thatwchar_t
is 32 bits then you could remove some code to optimize the function.Again if you know that
wchar_t
is 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expressionsizeof(wchar_t) > 2
is known at compile time, so any decent compiler will recognize dead code and remove it.There are several ways to do this, but the results depend on what the character encodings are in the
string
andwstring
variables.If you know the
string
is ASCII, you can simply usewstring
's iterator constructor:If your
string
has some other encoding, however, you'll get very bad results. If the encoding is Unicode, you could take a look at the ICU project, which provides a cross-platform set of libraries that convert to and from all sorts of Unicode encodings.If your
string
contains characters in a code page, then may $DEITY have mercy on your soul.