C++ UTF-8 actual string length

2019-04-10 17:36发布

Is there any native (cross platform) C++ function in any of standard libraries which returns the actual length of std::string?

Update: as we know std::string.length() returns the number of bytes not the number of characters. I already have a custom function which returns the actual one, but I'm looking for an standard one.

3条回答
贼婆χ
2楼-- · 2019-04-10 18:12

Actual length is the number of bytes. There is very little meaning to counting codepoints. You may though want to count other things like grapheme clusters.

See more about different kind of string lengths in http://utf8everywhere.org

查看更多
Anthone
3楼-- · 2019-04-10 18:19

codecvt ought to be helpful, the Standard provides implementations for UTF-8, for example codecvt_utf8<char32_t>() would be appropriate in this case.

Probably something like:

wstring_convert< codecvt_utf8<char32_t>, char32_t >().from_bytes(the_std_string).size()
查看更多
手持菜刀,她持情操
4楼-- · 2019-04-10 18:20

There is no way to do that in C/C++, without 3rd party libraries. Even if you convert to char32_t, you will get code points, not characters.

A code point does not match the user perception of a character, because of things like decompose formats, ligatures, variation selectors.

The closest available construct to a "user character" is a "grapheme cluster" (see http://www.unicode.org/reports/tr29/)

Your best cross-platform option is ICU4C (http://site.icu-project.org/)

查看更多
登录 后发表回答