I am writing a web crawler to fetch some Chinese web files. The fetched files are encoded in utf-8. And I need to read those file to do some parse, such as extracting the URLs and Chinese Characters. But I found that when I read the file into a std::string variable and output it into the console, the Chinese characters became garbage characters. I applied the boost::regex into the std::string variable and can extract all URLs but Chinese characters.
How can I solves those problems?
P.S. My CPP files are encoded as ANSI by default, the operating system is Win8 in Chinese Language;
This code may help (it was compiled with VC++ 2010). I tested it with an UTF-8 file containing non-latin characters and it seems to work, but I don't know if it will work fine with Chinese characters. Check the following links for more information: _setmode and codecvt_utf8.
If Chinese characters don't look as expected, make sure the console is using a font that supports UTF-16 (ie. don't use bitmap fonts).
if you need to display characters correctly, you can use libiconv from GNU. if you only need to process urls, std::string works fine. the problem is windows console's code page, not the string itself. use locale depends on os and stdc++lib's implementation, so I don't encourage using .
window's MultiByteToWideChar may help, but you need to check MS's specifications on how there functions perform conversions on strings.
In general, use the
w
variants, (wstring
,wfstream
,wcout
), set your locales to match the requirements, hang anL
on the front of string literals.locale::global(locale(""))
sets up to match the environment default, then on each stream that isn't running according to that default e.g.wcout.imbue(locale("Chinese_China.936"))
might be Microsoft's name for your terminal's locale settings. This has always been enough to do what I want, hope it works as well for you.