Why do Perl string operations on Unicode character

2020-02-01 08:31发布

Perl:

$string =~ s/[áàâã]/a/gi; #This line always prepends an "a"
$string =~ s/[éèêë]/e/gi;
$string =~ s/[úùûü]/u/gi;

This regular expression should convert "été" into "ete". Instead, it is converting it to "aetae". In other words, it prepends an "a" to every matched element. Even "à" is converted to "aa".

If I change the first line to this

$string =~ s/(á|à|â|ã)/a/gi;

it works, but... Now it prepends an e to every matched element (like "eetee").

Even though I found a suitable solution, why does it behave that way?

Edit 1:

I added "use utf8;", but it did not change the behavior (although it broke my output in JavaScript/AJAX).

Edit2:

The Stream originates from an Ajax Request, performed by jQuery. The site it originates from is set to UTF-8.

I am using Perl v5.10 (perl -v returns "This is perl, v5.10.0 built for i586-linux-thread-multi").

7条回答
Lonely孤独者°
2楼-- · 2020-02-01 08:54

This could also be a problem with Unicode Normalisation, as certain systems (I'm looking at you, OS X) represent extended Latin1 glyphs as a specific normalised representation that can break regular expressions when you refer to a character specifically instead of using a unicode or hex representation.

查看更多
3楼-- · 2020-02-01 08:55

I'd say you shouldn't really use regular expressions here. The easiest way to achieve this (although this might be undesirable) would be to convert your input string into US ASCII. The appropriate conversion tables should know that e is the closest equivalent to é.

Another option would be to use Unicode and normalize your string into NFD. This will break up all accented letters into base letter + diacritic. Then you can just go through your string and remove all combining diacritical characters.

查看更多
一纸荒年 Trace。
4楼-- · 2020-02-01 08:58

Something tells me it's because it doesn't know how to behave with characters with accent. By looking at your regular expression, everything seems fine. You might want to add:

use utf8;
查看更多
太酷不给撩
5楼-- · 2020-02-01 09:04

The problem is very likely down to not having

use utf8;

(or its equivalent for whatever coding system you are using) in your program. The weird replacements you have there look like problems with bytewise rather than characterwise regular expression replacement.

#!/usr/local/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, "utf8";
my $string = "été";

$string =~ s/[áàâã]/a/gi; #This line always prepends an "a"
$string =~ s/[éèêë]/e/gi;
$string =~ s/[úùûü]/u/gi;

print "$string\n";

prints

ete

If you are reading input from a file or from standard input, make sure you have the stream set to utf8 or whatever is appropriate for the encoding. For STDIN use

binmode STDOUT, "utf8";

If you are reading from a file, use

open my $file, "<:utf8", "file_name"

to get the encoding right. If it is not in UTF-8, use encoding(name) instead of utf8.

查看更多
做个烂人
6楼-- · 2020-02-01 09:11

This is probably due to the fact that you're using UTF8 strings, and it's parsing them as if they're not, or similar.

Instead of using something like [áàâã] you should probbaly use something like [\xE1-\xE5]

and probably use use utf8; in your code too

查看更多
欢心
7楼-- · 2020-02-01 09:20

But did you really want to use regexes at all? Perhaps something like Text::Unidecode would be better

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
ete
查看更多
登录 后发表回答