-->

How to remove diacritics in Perl 6

2019-04-29 00:26发布

问题:

Two related questions. Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä, U+00E4) or two and more combined symbols (like and ḏ̣). This little code

my @symb;
@symb.push("ä");
@symb.push("p" ~ 0x304.chr); # "p̄" 
@symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for @symb;

gives the following output:

ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character

But sometimes I would like to be able to do the following. 1) Remove diacritics from ä. So I need some method like

"ä".mymethod → "a"

2) Split "combined" symbols into parts, i.e. split into p and Combining Macron U+0304. E.g. something like the following in bash:

$ echo p̄ | grep . -o | wc -l
2

回答1:

Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.

Per the documentation:

multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method    samemark(Str:D: Str:D $pattern --> Str:D)

Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.

Examples:

say 'åäö'.samemark('aäo');                        # OUTPUT: «aäo␤» 
say 'åäö'.samemark('a');                          # OUTPUT: «aao␤» 

say samemark('Pêrl', 'a');                        # OUTPUT: «Perl␤» 
say samemark('aöä', '');                          # OUTPUT: «aöä␤» 

This can be used both to remove marks/diacritics from letters, as well as to add them.

For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.

say "p̄".ords;                  # OUTPUT: «(112 772)␤»

You can use the uniname method/routine to get the Unicode name for a codepoint:

.uniname.say for "p̄".ords;     # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»

or just use the uninames method/routine:

.say for "p̄".uninames;         # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»

If you just want the number of codepoints in the string, you can use codes:

say "p̄".codes;                 # OUTPUT: «2␤»

This is different than chars, which just counts the number of characters in the string:

say "p̄".chars;                 # OUTPUT: «1␤»

Also see @hobbs' answer using NFD.



回答2:

This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.

my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino

The .NFD method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str then assembles those codepoints back into a string.

You can also put these pieces together to answer your second question; e.g.:

$in.NFD.map: { Uni.new($_).Str }

will return a list of 1-character strings, each with a single decomposed codepoint, or

$in.NFD.map(&uniname).join("\n")

will make a nice little unicode debugger.



回答3:

I can't say this is better or faster, but I strip diacritics in this way:

my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"


标签: perl6