There's a string that is in UTF-8 encoding, I can read it from a file and write it into another file just fine. But when I try to load each of the characters in that string one by one the result isn't coherent. I'm most likely doing this in a very wrong way, but what is the correct way to do this?
The content in source.txt
is
afternoon_gb_1 ɑftənun
The code i wrote is
while (source >> word >> word_ipa) {
for (char& c : word_ipa)
myfile <<word<<" is " << c<< endl;}
The content in the txt file myfile
gets written as
afternoon_gb_1 is �
afternoon_gb_1 is �
afternoon_gb_1 is f
afternoon_gb_1 is t
afternoon_gb_1 is �
afternoon_gb_1 is �
afternoon_gb_1 is n
afternoon_gb_1 is u
afternoon_gb_1 is n
In UTF-8 each code-point (=logical character) is represented by multiple code units (=
char
); ɑftənun, in particular, is:(ch=character; c.p.: code point number; c.p. code unit representation in UTF-8; c.u. and c.p. are expressed in hexadecimal)
The exact details of how the code points are mapped to the code units is explained in many places; the very basics are that:
If you print out each code unit on its own you are breaking the UTF-8 encoding for the code points that require more than one code unit to be expressed. Your terminal application in the first row sees
(the first code unit followed by a newline), and immediately detects that this is a broken UTF-8 sequence, as c9 has the high bit set but the next c.u. doesn't have it; hence the � character. The same holds for the second character, as well as for the c.u. parts of the sequence representing ə.
Now, if you want to print out full code-points (not code-units),
std::string
won't be of any help -std::string
knows nothing about this stuff, it is essentially a glorifiedstd::vector<char>
, completely oblivious of encoding issues; all it does is to store/index code units, not code points.There are however third party libraries to help work with this; utf8-cpp is a small but complete one; in your case, the
utf8::next
function would be particularly helpful:utf8::next
here just increments the given iterator to make it point to the code point that starts the next code unit; this code makes sure that we print together all the code units that make up a single code point.Notice that we can reproduce its barebones behavior quite simply, it's just a matter of reading the UTF-8 specs (see the first table in the link to Wikipedia above):
Here we are exploiting the fact that the first byte of a sequence declares how many extra code points are going to come to complete the code unit.
(notice that this expects valid UTF-8 and does not do any attempt to resynchronize a broken UTF-8 sequence; the library version probably fares way better in this regard)
OTOH, it's also possible to inline just what's necessary to keep the same code unit together:
Here instead we are disregarding completely the "declared count" in the first c.u., we just check if the high bit is set; in this case, we go on printing as long as we get c.u. with the top two bytes set to 10 (in binary, AKA 2 in decimal) - since the "continuation c.u." of a multi-c.u. UTF-8 sequence all follow this pattern.