Well I'm testing how to write a C++ application that actually can read (and change) text files while respecting the encoding used for the text. I wish (for other API's) to explicitly convert all read text to UTF-8 for internal use. Independent of what the actual encoding in the file was.
I am on windows and testing a textfile encoded using "ansi" "UTF-8" (those seem to work correctly). And then "unicode big endian" doesn't work; the std::getline
result seems to be the raw byte array, no conversion of the file (UTF-16??) to UTF-8.
How can I force this? I do not know beforehand what the file is encoded with. Code used:
std::string retString;
if (isValidIndex(file_index) && OpenFilestreams()[file_index]->good()) {
std::getline(*OpenFilestreams()[file_index], retString);
}
return retString;
Where file is OpenFilestreams()
"is" a vector (static one containing all opened files), and file_index
an index in the vector. So how to make sure here that it reads using the correct encoding?
As for the use:
Actually trying to convert it to a wstring using:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
return converter.from_bytes(input.c_str());
Gives the a std::range_error
exception. (I need wstring for other windows api functions).
There is no way that std::getline can get the encoding of the file. You can uses std::locale to change the encoding used.
Some Unicode files contain BOM (which state the encoding used), by this is not required.
Normally the text applications if the BOM is present use that encoding and if not try to make heuristics for identify the encoding used and read the text with that encoding, normalize the text (ex: UTF8), assume in the rest of the app the text is in UTF8, and saved in the same encoding that was read.
Some info about Unicode Joel Spolsky Unicode Article
Other Article about Reading Unicode Encodings in C++