I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.
sub sort_by_default {
my @sorted_lines = sort {
$a <=> $b
||
fc( $a) cmp fc($b)
} @_;
}
The cmp
used with sort
can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my @lines = <$fh>;
chomp @lines;
my $uc = Unicode::Collate->new();
my @sorted = $uc->sort(@lines);
say for @sorted;
The module's cmp
method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my @sorted = map { $uc->cmp($a, $b) } @data;
where $a
and $b
need be set suitably so to extract what to compare from @data
.
If you have utf8 data right in the source you need use utf8
, while if you receive utf8 via yet other channels (from @ARGV
included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this EffectivePerler article on custom sorting.
† Example: by codepoint comparison ä
> b
while the accepted order in German is ä
< b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
@s = qw(ä b);
say join " ", sort { $a cmp $b } @s; #--> b ä
say join " ", Unicode::Collate->new->sort(@s); #--> ä b
'
so we need to use Unicode::Collate
(or a custom sort routine).
To open a file saved as UTF-8, use the appropriate layer:
open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;
Don't forget to set the same layer for the output.
#! /usr/bin/perl
use warnings;
use strict;
binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;
__DATA__
Борис
Peter
John
Владимир
The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:
open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";
If you are going to have UTF-8 inside the source of your script, then make sure you have:
use utf8;
At the beginning of the script.
If you are going to get UTF-8 characters from STDIN
, use this at the beginning of the script:
binmode(STDIN, ':encoding(UTF-8)');
For STDOUT
use:
binmode(STDOUT, ':encoding(UTF-8)');
Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8
or UTF8
will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8
will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict
. You may also read the question How do I sanitize invalid UTF-8 in Perl?
.
Finally, following @zdim advise, you may use at the beginning of the script:
use open ':encoding(UTF-8)';
And other variants as described here. That will set the encoding layer for all open
instructions that do not specify a layer explicitly.