How open STDIN/STDOUT handles and work with utf8 e

2020-04-10 03:03发布

问题:

I have utf8 characters in my code. So I do:

use utf8;

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
print $line; # Wide character in print at ...

Then I thought that my STDOUT should be in utf8:

use utf8;
use open IO => ':utf8 :std';

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
print $line; # Wide character in print at ...

Why when I say perl to use utf8 while my source code has utf8 characters I get the error?

In same time:

No error:

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
print $line;

No error:

use open IO => ':utf8 :std';

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
print $line;

How I should open my filehandles and work correctly with utf8?

UPD
Actually I have this code. It do not match:

use open IO => ':utf8 :std';

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
my @match =  $line =~ m/(вiд|от|від)/i;
print "$line -> $1 \n";

Unfortunately regex is not matched. The output is:

ЗГ. РАХ. №382 ВIД 03.02.2020Р ->

Then I add utf8 pragma:

use utf8;
use open IO => ':utf8 :std';

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
my @match =  $line =~ m/(вiд|от|від)/i;
print "$line -> $1 \n";

Now regex is matched, but warning is issued

Wide character in print at t2.pl line 17.
ЗГ. РАХ. №382 ВIД 03.02.2020Р -> ВIД

回答1:

Thank @Grinnz in IRC

Next code works:

use utf8;
use open ':encoding(UTF-8)', ':std';

my $line =  'ЗГ. РАХ. №382 ВIД 03.02.2020Р';
my @match =  $line =~ m/(вiд|от|від)/i;
print "$line -> $1 \n";

Notices: @Grinnz adviced to use https://metacpan.org/pod/open::layers because :std is not a layer, it must be its own argument in the list

Also I should not use :utf8 because

CAUTION: Do not use this layer to translate from UTF-8 bytes, as invalid UTF-8 or binary data will result in malformed Perl strings. It is unlikely to produce invalid UTF-8 when used for output, though it will instead produce UTF-EBCDIC on EBCDIC systems. The :encoding(UTF-8) layer (hyphen is significant) is preferred as it will ensure translation between valid UTF-8 bytes and valid Unicode characters.