Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é
would become a plain ASCII e
) from a UnicodeString
using the ICU library in C++? E.g.:
UnicodeString strip_diacritics( UnicodeString const &s ) {
UnicodeString result;
// ...
return result;
}
Assume that s
has already been normalized. Thanks.
After more searching elsewhere:
which is O(n).
ICU lets you transliterate a string using a specific rule. My rule is
NFD; [:M:] Remove; NFC
: decompose, remove diacritics, recompose. The following code takes an UTF-8std::string
as an input and returns another UTF-8std::string
: