I am fetching text from a utf-8 text file, and doing it by chunks to increase performance.
std::ifstream.read(myChunkBuff_str, myChunkBuff_str.length())
Here is a more detailed example
I am getting around 16 thousand characters with each chunk.
My next step is to convert this std::string
into something that can allow me to work on these "complex characters" individually, thus converting that std::string
into std::wstring
.
I am using the following function for converting, taken from here:
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
However, at its end of the chunk one of the Russian characters might be cut-off, and the conversion will fail, with an std::range_error exception
.
For example, in UTF-8 "привет" takes 15 chars and "приве" takes 13 chars. So, if my chunk was hypothetically 14, the 'т' would be partially missing, and the conversion would throw exception.
Question:
How to detect these partially-loaded character? ('т' in this case) This would allow me to convert without it, and perhaps shift the next chunk a bit earlier than planned, to include this problematic 'т' next time?
I don't want to try
or catch
around these functions, as try/catch might slow me down the program. It also doesn't tell me "how much of character was missing for the conversion to actually succeed".
I know about wstring_convert::converted()
but it's not really useful if my program crashes before I get to it
You could do this using a couple of functions.
UTF-8
has a way to detect the beginning of a multibyte character and (from the beginning) the size of the multibyte character.So two functions:
You could track back from the end of your buffer until
is_continuation(c)
is false. Then check ifsize(c)
of the currentUTF-8
char is longer than the end of the buffer.Disclaimer - last time I looked these functions were working but have not used them in a while.
Edit: to add.
If you feel like doing th whole thing manually I may as well post the code to convert a
UTF-8
multibyte character to aUTF-16
multibyte or aUTF-32
char.UTF-32 Is easy:
UTF-16 Is a little more tricky: