libxml2
seems to store all its strings in UTF-8, as xmlChar *
.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
As libxml2
is a C library, there's no provided routines to get an std::wstring
out of an xmlChar *
. I'm wondering whether the prudent way to convert xmlChar *
to a std::wstring
in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
Edit: Just an FYI, I'm very aware of what xmlStrlen
returns; it's the number of xmlChar
used to store the string; I know it's not the number of characters but rather the number of unsigned char
. It would have been less confusing if I had named it byteLength
, but I thought it would have been clearer as I have both charLength
and wcharLength
. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t
will be truncated (I think).
xmlStrlen()
returns the number of UTF-8 encoded codeunits in the xmlChar*
string. That is not going to be the same number of wchar_t
encoded codeunits needed when the data is converted, so do not use xmlStrlen()
to allocate the size of your wchar_t
string. You need to call std::mbtowc()
once to get the correct length, then allocate the memory, and call mbtowc()
again to fill the memory. You will also have to use std::setlocale()
to tell mbtowc()
to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
std::wstring wideString;
int charLength = xmlStrlen(xmlString);
if (charLength > 0)
{
char *origLocale = setlocale(LC_CTYPE, NULL);
setlocale(LC_CTYPE, "en_US.UTF-8");
size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
if (wcharLength != (size_t)(-1))
{
wideString.resize(wcharLength);
mbtowc(&wideString[0], (const char*) xmlString, charLength);
}
setlocale(LC_CTYPE, origLocale);
if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
}
return wideString;
}
A better option, since you mention C++11, is to use std::codecvt_utf8
with std::wstring_convert
instead so you do not have to deal with locales:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
try
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
return conv.from_bytes((const char*)xmlString);
}
catch(const std::range_error& e)
{
abort(); //wstring_convert failed
}
}
An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.
There are some problems in this code, besides the fact that you are using wchar_t
and std::wstring
which is a bad idea unless you're making calls to the Windows API.
xmlStrlen()
does not do what you think it does. It counts the number of UTF-8 code units (a.k.a. bytes) in a string. It does not count the number of characters. This is all stuff in the documentation.
Counting characters will not portably give you the correct size for a wchar_t
array anyway. So not only does xmlStrlen()
not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t
varies from platform to platform, making it 100% useless for portable code.
The mbtowcs()
function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!
This code will leak memory if the std::wstring
constructor throws an exception.
My recommendations:
Use UTF-8 if at all possible. The wchar_t
rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).
If you need UTF-32, then use std::u32string
. Remember that wstring
has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).
If you absolutely must have wchar_t
, then chances are good that you are on Windows. Here is how you do it on Windows:
std::wstring utf8_to_wstring(const char *utf8)
{
size_t utf8len = std::strlen(utf8);
int wclen = MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, NULL, 0);
wchar_t *wc = NULL;
try {
wc = new wchar_t[wclen];
MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, wc, wclen);
std::wstring wstr(wc, wclen);
delete[] wc;
wc = NULL;
return wstr;
} catch (std::exception &) {
if (wc)
delete[] wc;
}
}
If you absolutely must have wchar_t
and you are not on Windows, use iconv()
(see man 3 iconv
, man 3 iconv_open
and man 3 iconv_close
for the manual). You can specify "WCHAR_T"
as one of the encodings for iconv()
.
Remember: You probably don't want wchar_t
or std::wstring
. What wchar_t
does portably isn't useful, and making it useful isn't portable. C'est la vie.