Using std::wstring
the way I am with MultiByteToWideChar
?
std::wstring widen(const std::string &in)
{
int len = MultiByteToWideChar(CP_UTF8, 0, &in[0], -1, NULL, 0);
std::wstring out(len, 0);
MultiByteToWideChar(CP_UTF8, 0, &in[0], -1, &out[0], len);
return out;
}
If you're asking will it work, probably. Is it correct?
- You should use
in.c_str()
instead of &in[0]
- You should check the return value of
MultiByteToWideChar
at least the first time.
MultiByteToWideChar
invoked with a (-1) length, if successful, will include accounting for a zero-terminator (i.e. it will always return >= 1 on success). The length-constructor for std::wstring
does not require this. std::wstring(5,0)
will allocate space for six wide-chars; 5+zero-term. So technically you're allocating one-too-many wide-chars.
From MultiByteToWideChar
docs on cbMultiByte
and -1:
If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting Unicode string has a terminating null character, and the length returned by the function includes this character.
There is a problem with your first call to MultiByteToWideChar
: The character sequence is not guaranteed to be zero terminated (although in practice it usually is). Change that line to
int len = MultiByteToWideChar(CP_UTF8, 0, in.c_str(), -1, NULL, 0);
and you should be safe. Even if MultiByteToWideChar
fails and returns 0 this is accounted for by passing len
as the final parameter in the second call to MultiByteToWideChar
.
With that said, it is safe in the sense that it doesn't crash or corrupt memory. There is, however, one more issue: Unless the input string causes MultiByteToWideChar
to fail the returned string will claim that its size()
is one character larger than it should be. I would recommend changing the code as follows:
std::wstring widen(std::string const &in)
{
std::wstring out{};
if (in.length() > 0)
{
// Calculate target buffer size (not including the zero terminator).
int len = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
in.c_str(), in.size(), NULL, 0);
if ( len == 0 )
{
throw std::runtime_error("Invalid character sequence.");
}
out.resize(len);
// No error checking. We already know, that the conversion will succeed.
MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
in.c_str(), in.size(), &out[0], out.size());
// Use out.data() in place of &out[0] for C++17
}
return out;
}
This implementation addresses the following issues:
- It reports errors in case the input sequence is not valid UTF-8, by passing the
MB_ERR_INVALID_CHARS
flag.
- Errors are reported by throwing exceptions. That makes it possible to distinguish between conversion errors and a successful call, that returns a zero-sized string. (Note: The
std::wstring
c'tor already throws exceptions in case of failure. It would feel unnatural to not throw exceptions for other errors.)
- The implementation properly deals with input containing embedded
NUL
characters. This is rarely used, but when it is (e.g. when composing the OPENFILENAME's lpstrFilter member), it won't (silently) fail for that reason.
- It doesn't over-allocate the return value's container storage. In case the cbMultiByte argument is set to
-1
in a call to MultiByteToWideChar
, the returned length does include space for the zero terminator. This character, however, is owned by the std::string
implementation, and not part of the character sequence to be converted.
- Related to the previous bullet point, this implementation doesn't convert the zero terminator. The original code did, and the returned string produces 2
NUL
characters at the end of the string, when the c_str()
member is invoked.
No, since a std::wstring
is not guaranteed to store it's data in a contiguous block of memory (though it most likely does in your implementation). Use a std::vector<wchar_t>
instead.
The other answers are good but I want to add some extra information for future visitors based on my own research into the same issue.
Microsoft developer, Larry Osterman, has a good blog post describing such a function with a very good point about the return code checking and NRVO (Named Return Value Optimization). You should read the post for discussion if it's still available. I'm including his final code just in case the post goes missing.
std::wstring UnicodeStringFromAnsiString(_In_ const std::string &ansiString)
{
std::wstring returnValue;
auto wideCharSize = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), -1, nullptr, 0);
if (wideCharSize == 0)
{
return returnValue;
}
returnValue.resize(wideCharSize);
wideCharSize = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, ansiString.c_str(), -1, &returnValue[0], wideCharSize);
if (wideCharSize == 0)
{
returnValue.resize(0);
return returnValue;
}
returnValue.resize(wideCharSize-1);
return returnValue;
}
In my own usage, I was able to add the optimization mentioned in the blog comments and not need -1 for the ANSI string length.
C++17 (Section 21.3.1.7.1) documents a newly-added non-const data()
method which should be used instead of &in[0]
to get a mutable pointer.
charT* data() noexcept;
STL owns the trailing \0
in the c_str()
results so be careful how you manipulate the string size.