Reading unicode characters

2019-05-31 17:28发布

问题:

I want to read unicode file (utf-8) character by character, but I don't know how to to read from file one by one character.

Can anyone to tell me how to do that.

EDIT: I want to read one by one letter from file

回答1:

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description

Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:

(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.

(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:

UnicodeByte1 =   (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);

And so on...

Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.



回答2:

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:

#include <iostream>
#include <string>
#include <fstream>

std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)),
             std::istreambuf_iterator<char>());

The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:

for(std::string::iterator i = content.begin();
    i != content.end();
    ++i)
{
    char nextChar = *i;
    // do stuff here.
}

Alternatively, you could open the file in binary mode, and then move through each byte that way:

std::ifstream fs("my_file.txt", std::ifstream::binary);
if(fs.is_open())
{
    char nextChar;
    while(fs.good())
    {
        fs >> nextChar;
        // do stuff here.
    }
}

If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.

QFile file;
if(file.open("my_file.text")
{
    QTextStream in(&file);
    in.setCodec("UTF-8")
    QString contents = in.readAll();
    return 
}


回答3:

In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:

{
    if(i_ch == nullptr) return -1;
    int l = 0;
    char ch = *i_ch;
    int mask = 0x80;
    while(ch & mask) {
        l++;
        mask = (mask >> 1);
    }
    if (l < 4) return -1;
    return l;
}  

It's take less time than research how shell using mblen.



回答4:

try this: get the file and then loop through the text based on it's length

Pseudocode:

String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
    String the_character = s[i].

    // TODO : Do your thing :o)
}