I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8
. In the Perl program I have tried the following methods:
use Unicode::Collate
with$Collator = Unicode::Collate->new();
use Unicode::Collate::Locale
with$Collator = Unicode::Collate->new(locale => $ENV{'LANG'});
use locale
Each one of them failed with the following errors (from the Perl side):
- Input is not sorted: [----,] came after [($1]
- Input is not sorted: [...] came after [&]
- Input is not sorted: [($1] came after [1]
The only method that worked for me involved setting LC_ALL=C
for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.
Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what
use locale;
is for.I don't know why you didn't get the desired order out of
cmp
underuse locale;
.You could process the decompressed files.
It'll require more temporary storage, of course, but you'll get exactly the order you want.
I found a case
use locale;
didn't cause Perl'ssort
/cmp
to give the same result as thesort
utility. Weird.Truth be told, it's the
sort
utility that's weird.In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the
sort
utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.