I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.
My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.
Does anybody know how to do it?
I already tried this:
locale mylocale("cs_CZ.utf-8");
mbstate_t mystate;
wstring mywstring = L"čřžýáí";
const codecvt<wchar_t,char,mbstate_t>& myfacet =
use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);
codecvt<wchar_t,char,mbstate_t>::result myresult;
size_t length = mywstring.length();
char* pstr= new char [length+1];
const wchar_t* pwc;
char* pc;
// translate characters:
myresult = myfacet.out (mystate,
mywstring.c_str(), mywstring.c_str()+length+1, pwc,
pstr, pstr+length+1, pc);
if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
cout << "Translation successful: " << pstr << endl;
else cout << "failed" << endl;
return 0;
which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.
C++ has no idea of Unicode. Use an external library such as ICU (UnicodeString
class) or Qt (QString
class), both support Unicode, including UTF-8.
The code below might help you :)
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
What's your platform? Note that Windows does not support UTF-8 locales so this may explain why you're failing.
To get this done in a platform dependent way you can use MultiByteToWideChar/WideCharToMultiByte on Windows and iconv on Linux. You may be able to use some boost magic to get this done in a platform independent way, but I haven't tried it myself so I can't add about this option.
What locale does is that it gives the program information about the external encoding, but assuming that the internal encoding didn't change. If you want to output UTF-8 you need to do it from wchar_t
not from char*
.
What you could do is output it as raw data (not string), it should be then correctly interpreted if the systems locale is UTF-8.
Plus when using (w)cout
/(w)cerr
/(w)cin
you need to imbue the locale on the stream.
The Lexertl library has an iterator that lets you do this:
std::string str;
str.assign(
lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()),
lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end()));