可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).

All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.

This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?

回答1:

$ "\xC3\x83\xC2\xAB"
Ã«
$ use Encode

$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë

You have double-encoded UTF-8. Encode::Repair is one way to deal with that.

回答2:

Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü can be represented either by the single character ü or by u followed by ¨, which a text renderer would then combine.

See the Wikipedia article on Unicode equivalence for gory details.

Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.

回答3:

(I'm answering your subject question, "Can there be 2 different UTF-8 encodings for the same character?", which is significantly different from the question inside the post.)

("Character" usually means string element. It's ambiguous at beast, and it's not the right word to use here. The Unicode term for a visual representation, a glyph, is "grapheme".)

Yes, there are more than sequence of code points can result in the same grapheme. For example, both

U+00EB  LATIN SMALL LETTER E WITH DIAERESIS

and

U+0065  LATIN SMALL LETTER E
U+0308  COMBINING DIAERESIS

should display as "ë". Let's see how your browser does:

U+00EB: "ë"
U+0065,0308: "ë"

In UTF-8, these code points would be encoded as

U+00EB: C3 AB
U+0065: 65
U+0308: CC 88

One would use Unicode::Normalize's NFC or NFD to normalize a string to one of two formats (your choice).

$ perl -MUnicode::Normalize -E'
   $x = "\x{00EB}";
   $y = "\x{0065}\x{0308}";

   say     $x  eq     $y  ?1:0;
   say NFC($x) eq NFC($y) ?1:0;
   say NFD($x) eq NFD($y) ?1:0;
'
0
1
1

There's also something called "overlong" encodings in UTF-8. (Specifically UTF-8, not Unicode in general.) In UTF-8, Unicode code points are encoded using one of the four following bit patterns:

1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The "x"s represent the code point to encode. One must use the shortest possible, so U+00EB would be

0000 0000 1110 1011
      --- ---- ----

   -----   ------
110xxxxx 10xxxxxx
11000011 10101011
C3       AB

But someone clever might do

0000 0000 1110 1011
---- ---- ---- ----

    ----   ------   ------
1110xxxx 10xxxxxx 10xxxxxx
11100000 10000011 10101011
E0       83       AB

Applications should reject E0 83 AB (or at least convert it to C3 AB), but some don't, and that can cause security problems. Perl's Encode module treats that sequence as invalid, so it shouldn't be an issue for Perl.