I want to read unicode file (utf-8) character by character, but I don't know how to to read from file one by one character.
Can anyone to tell me how to do that.
EDIT: I want to read one by one letter from file
I want to read unicode file (utf-8) character by character, but I don't know how to to read from file one by one character.
Can anyone to tell me how to do that.
EDIT: I want to read one by one letter from file
In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:
It's take less time than research how shell using mblen.
First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description
Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:
(Row 1) If the most significant bit is 0 (
char & 0x80 == 0
) you have your character.(Row 2) If the three most significant bits are 110 (
char & 0xE0 == 0xc0
), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:And so on...
Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.
UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:
The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:
Alternatively, you could open the file in binary mode, and then move through each byte that way:
If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.
try this: get the file and then loop through the text based on it's length
Pseudocode: