Reading in Russian characters (Unicode) using a ba

2019-07-22 04:29发布

问题:

Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.

I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251). And buf is of type basic_string<wchar_t>

The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).

This all works perfectly with english characters...

while (file >> ch)
{
    if(isalnum(ch, loc))
    {
        buf += ch;
    }
    else if(!buf.empty())
    {
        // Do stuff with buf.
        buf.clear();
    }
}

I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...

回答1:

Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.



回答2:

There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.



回答3:

Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a std::codecvt<wchar_t, char, mbstate_t> used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.

If you want to use unicode with iostreams, you need to specify a codecvt facet with the form std::codecvt<wchar_t, wchar_t, mbstate_t>, which just passes through data unchanged.



回答4:

I am not sure, but you can try to call setlocale(LC_CTYPE, "");