Using “sort” for utf8 strings in Perl

2019-07-29 05:39发布

I'm trying to figure out how to sort an array alphabetically in Perl. Here is what I have that works fine in english:

   # List of countries (kept like this to keep clean, as its re-used in other places)
    my $countries = {
        'AT' => "íAustria",
        'AU' => "Australia",
        'BE' => "Belgium",
        'BG' => "Bulgaria",
        'CA' => "Canada",
        'CY' => "Cyprus",
        'CZ' => "Czech Republic",
        'DK' => "Denmark",
        'EN' => "England",
        'EE' => "Estonia",
        'FI' => "Finland",
        'FR' => "France",
        'DE' => "Germany",
        'GB' => "Great Britain",
        'GR' => "Greece",
        'HU' => "Hungary",
        'IE' => "Ireland",
        'IT' => "Italy",
        'LV' => "Latvia",
        'LT' => "Lithuania",
        'LU' => "Luxembourg",
        'MT' => "Malta",
        'NZ' => "New Zealand",
        'NL' => "Netherlands",
        'PL' => "Poland",
        'PT' => "Portugal",
        'RO' => "Romania",
        'SK' => "Slovakia",
        'SI' => "Slovenia",
        'ES' => "Spain",
        'SE' => "Sweden",
        'CH' => "Switzerland",
        'SC' => "Scotland",
        'UK' => "United Kingdom",
        'US' => "USA",
        'TK' => "Turkey",
        'NO' => "Norway",
        'MX' => "Mexico",
        'IL' => "Israel",
        'IN' => "India",
        'IS' => "Iceland",
        'CN' => "China",
        'JP' => "Japan",
        'VN' => "áVietnamí"
    };
   # Populate the original loop with "name" and "code"
    my @country_loop_orig;
    print $IN->header;
    foreach (keys %{$countries}) {
      push @country_loop_orig, {
        name => $countries->{$lang}->{$_},
        code => $_
      }
    }

   # sort it alphabetically
   my @country_loop = sort { lc($a->{name}) cmp lc($b->{name})  } @country_loop_orig;

This works fine with the English versions:

Australia
Austria
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
Vietnam

...but when you try and do it with utf8 such as íéó etc, it doesn't work:

Australia
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
áVietnam
íAustria

How do you achieve this? I found Sort::Naturally::XS, but couldn't get it to work.

标签: perl utf-8
1条回答
疯言疯语
2楼-- · 2019-07-29 06:22

The Unicode::Collate should help with this.

A simple example that sorts your last list

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;

my $uc  = Unicode::Collate->new();
my @sorted = $uc->sort(@list);

say for @sorted;

However, in some languages non-ascii characters may have a very particular accepted placement, and the question doesn't provide any details. Then perhaps Unicode::Collate::Locale may help.

See (study) this perl.com article and this post (T. Christiansen), and this effectiveperler article.


If data to be sorted by is in a complex data structure, cmp method is for individual comparison

my @sorted = map { $uc->cmp($a, $b) } @list;
查看更多
登录 后发表回答