Two related questions.
Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä
, U+00E4
) or two and more combined symbols (like p̄
and ḏ̣
). This little code
my @symb;
@symb.push("ä");
@symb.push("p" ~ 0x304.chr); # "p̄"
@symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for @symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.
1) Remove diacritics from ä
. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄
into p
and Combining Macron U+0304
. E.g. something like the following in bash
:
$ echo p̄ | grep . -o | wc -l
2
I can't say this is better or faster, but I strip diacritics in this way:
This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.
The
.NFD
method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. theUni.new(...).Str
then assembles those codepoints back into a string.You can also put these pieces together to answer your second question; e.g.:
will return a list of 1-character strings, each with a single decomposed codepoint, or
will make a nice little unicode debugger.
Perl 6 has great Unicode processing support in the
Str
class. To do what you are asking in (1), you can use thesamemark
method/routine.Per the documentation:
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the
ords
method to get aList
(technically aPositional
) of all the codepoints in the string.You can use the
uniname
method/routine to get the Unicode name for a codepoint:or just use the
uninames
method/routine:If you just want the number of codepoints in the string, you can use
codes
:This is different than
chars
, which just counts the number of characters in the string:Also see @hobbs' answer using
NFD
.