how can I compare utf8 string such as persian word

2019-01-20 03:45发布

问题:

I want to compare strings in Persian (utf8). I know I must use some thing like L"گل" and it must be saved in wchar_t * or wstring. the question is when I compare by the function compare() strings I dont get the right result.

回答1:

If the strings that you want to compare are in a specific, definite encoding already, then don't use wchar_t and don't use L"" literals -- those are not for Unicode, but for implementation-defined, opaque encodings only.

If your strings are in UTF-8, use a string of chars. If you want to convert them to raw Unicode codepoints (UCS-4/UTF-32), or if you already have them in that form, store them in a string of uint32_ts, or char32_ts if you have a modern compiler.

If you have C++11, your literal can be char str8[] = u8"گل"; or char32_t str32[] = U"گل";. See this topic for some more on this.

If you want to interact with command line arguments or the environment, use iconv() to convert from WCHAR to UTF-32 or UTF-8.



回答2:

wchar_t is not for UTF-8, but (depending on the platform) typically either UTF-16 or UCS-32. If you want to work on UTF-8, use plain old char * or string, and their comparison functions for equality. If you want human-meaingful sorting, it gets much more involved (no matter which encoding you use).



回答3:

Unicode is notoriously difficult to compare.

Note that any Unicode encoding, including UTF-8, 16 or 32 cannot be compared byte-wise for anything other than byte-equality. The display may be identical, but the bytes used (such as R->L markers, surrogate pairs, display modifiers, and similar used in non-English languages such as Persian) will not be.

Generally, you need to normalize Unicode before you can make a realistic comparison if the meaning of the text has any significance:

http://userguide.icu-project.org/transforms/normalization