Getting the string length on UTF-8 in C? [closed]

2019-09-20 18:21发布

Can this be done using a method similar to this one:

As long as the current element of the string the user input via scanf is not \0, add one to the "length" int and then print out the length.

I would be very grateful if anybody could guide me through the least complex way possible as I am a beginner.

Thank you very much, have a good one!

标签: c string utf-8
1条回答
我想做一个坏孩纸
2楼-- · 2019-09-20 18:57

What do you mean by string length?

The number of bytes is easily obtained with strlen(s).

The number of code points encoded in UTF-8 can be computed by counting the number of single byte chars (range 1 to 127) and the number of leading bytes (range 0xC0 to 0xFF), ignoring continuation bytes (range 0x80 to 0xBF) and stopping at '\0'.

Here is a simple function to do this:

size_t count_utf8_code_points(const char *s) {
    size_t count = 0;
    while (*s) {
        count += (*s++ & 0xC0) != 0x80;
    }
    return count;
}

This function assumes that the contents of the array pointed to by s is properly encoded.

Also note that this will compute the number of code points, not the number of characters displayed, as some of these may be encoded using multiple combining code points, such as <LATIN CAPITAL LETTER A> followed by <COMBINING ACUTE ACCENT>.

查看更多
登录 后发表回答