When I use locale
, some characters from my locale (et_EE.UTF-8) are not matched with \w
and I don't see any reason there.
In addition to ASCII, Estonian uses six more characters:
õäöüšž
In my test script below I use them in $string
with three additional special characters ðŋц
(which do not belong to the Estonian alphabet).
use feature 'say';
use POSIX qw( locale_h );
{
use utf8;
my $string = "õäöüšž ðŋц";
binmode STDOUT, ":encoding(UTF-8)";
say "nothing";
say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
say 'UC: ', uc( $string );
say 'SORT: ', sort( split(//, $string) );
say $string =~ m/\w/g;
say $string =~ m/\p{Word}/g;
say '';
}
{
use utf8;
use locale;
binmode STDOUT, ":encoding(UTF-8)";
my $string = "õäöüšž ðŋц";
say "locale";
say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
say 'UC: ', uc( $string );
say 'SORT: ', sort( split(//, $string) );
say $string =~ m/\w/g;
say $string =~ m/\p{Word}/g;
say '';
}
{
use utf8::all;
my $string = "õäöüšž ðŋц";
say "utf8::all";
say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
say 'UC: ', uc( $string );
say 'SORT: ', sort( split(//, $string) );
say $string =~ m/\w/g;
say $string =~ m/\p{Word}/g;
say '';
}
{
use utf8::all;
use locale;
my $string = "õäöüšž ðŋц";
say "utf8::all + locale";
say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
say 'UC: ', uc( $string );
say 'SORT: ', sort( split(//, $string) );
say $string =~ m/\w/g;
say $string =~ m/\p{Word}/g;
say '';
}
I tried with Perl 5.10.1 and 5.14.2 and both gave me such output:
nothing
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT: äðõöüŋšžц
õäöüšžðŋц
õäöüšžðŋц
locale
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT: ðŋšžõäöüц
šžŋц
õäöüšžðŋц
utf8::all
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT: äðõöüŋšžц
õäöüšžðŋц
õäöüšžðŋц
utf8::all + locale
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT: ðŋšžõäöüц
šžŋц
õäöüšžðŋц
What is not like I expected?
- main problem: under
use locale
I hoped\w
to match all my six chars, but the resultšžŋц
is quite a weird. Why such matches? From perlrecharclass i read:
For code points above 255 ... \w matches the same as \p{Word} matches in this range. ... For code points below 256 ... if locale rules are in effect ... \w matches the platform's native underscore character plus whatever the locale considers to be alphanumeric.
So, \w
matches there chars above 255, but does not match "whatever the locale considers to be alphanumeric". Why? Same time sorting under locale works fine (and without locale does not), the result ðŋšžõäöüц
is right order, that shows that there are right chars properly represented. AFAIU, sort could not work fine without knowing them "whatever the locale considers to be alphanumeric". Or?
- i thought that
setlocale
gives result only under locale-pragma. How could i test, which locale is effective for scope? - i did not expect that all characters are upper-cased in every test case. AFAIU
uc
andlc
should be locale dependent. On first case i thought they will all lower-cased, but using locale i waited first six chars being upper-cased while others not. Only case i waited all chars upper-cased, was third. I see i miss something important here. Oops, now i found fromlc
docs: "Otherwise, If EXPR has the UTF-8 flag set: Unicode semantics are used for the case change." UTF-8 flag is always set on my$string
, so this got answer during writing it.
Using locale
for sorting and \p{Word}
for matching is acceptable for me, but i still would use some hints: why \w
does not work as i expected?