Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.
I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251).
And buf is of type basic_string<wchar_t>
The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).
This all works perfectly with english characters...
while (file >> ch)
{
if(isalnum(ch, loc))
{
buf += ch;
}
else if(!buf.empty())
{
// Do stuff with buf.
buf.clear();
}
}
I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...
Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.
There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.
Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a std::codecvt<wchar_t, char, mbstate_t>
used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.
If you want to use unicode with iostreams, you need to specify a codecvt facet with the form std::codecvt<wchar_t, wchar_t, mbstate_t>
, which just passes through data unchanged.
I am not sure, but you can try to call setlocale(LC_CTYPE, "");