How can Perl and Unix sort, order Unicode strings

2019-04-17 23:56发布

问题:

I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8. In the Perl program I have tried the following methods:

  • use Unicode::Collate with $Collator = Unicode::Collate->new();
  • use Unicode::Collate::Locale with $Collator = Unicode::Collate->new(locale => $ENV{'LANG'});
  • use locale

Each one of them failed with the following errors (from the Perl side):

  • Input is not sorted: [----,] came after [($1]
  • Input is not sorted: [...] came after [&]
  • Input is not sorted: [($1] came after [1]

The only method that worked for me involved setting LC_ALL=C for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.

回答1:

Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

I don't know why you didn't get the desired order out of cmp under use locale;.

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do
   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"
done | sort | uniq -c

It'll require more temporary storage, of course, but you'll get exactly the order you want.


I found a case use locale; didn't cause Perl's sort/cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8

$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data
(
($1
1

$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data
(
($1
1

$ sort data
(
1
($1

Truth be told, it's the sort utility that's weird.


In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.