Can there be 2 different UTF-8 encodings for the s

I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).

All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.

This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?

标签： perl utf-8 character-encoding

3条回答

beautiful°

2楼-- · 2019-04-10 03:51

$ "\xC3\x83\xC2\xAB"
Ã«
$ use Encode

$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë

You have double-encoded UTF-8. Encode::Repair is one way to deal with that.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

3楼-- · 2019-04-10 04:09

(I'm answering your subject question, "Can there be 2 different UTF-8 encodings for the same character?", which is significantly different from the question inside the post.)

("Character" usually means string element. It's ambiguous at beast, and it's not the right word to use here. The Unicode term for a visual representation, a glyph, is "grapheme".)

Yes, there are more than sequence of code points can result in the same grapheme. For example, both

U+00EB  LATIN SMALL LETTER E WITH DIAERESIS

and

U+0065  LATIN SMALL LETTER E
U+0308  COMBINING DIAERESIS

should display as "ë". Let's see how your browser does:

U+00EB: "ë"
U+0065,0308: "ë"

In UTF-8, these code points would be encoded as

U+00EB: C3 AB
U+0065: 65
U+0308: CC 88

One would use Unicode::Normalize's NFC or NFD to normalize a string to one of two formats (your choice).

$ perl -MUnicode::Normalize -E'
   $x = "\x{00EB}";
   $y = "\x{0065}\x{0308}";

   say     $x  eq     $y  ?1:0;
   say NFC($x) eq NFC($y) ?1:0;
   say NFD($x) eq NFD($y) ?1:0;
'
0
1
1

There's also something called "overlong" encodings in UTF-8. (Specifically UTF-8, not Unicode in general.) In UTF-8, Unicode code points are encoded using one of the four following bit patterns:

1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The "x"s represent the code point to encode. One must use the shortest possible, so U+00EB would be

0000 0000 1110 1011
      --- ---- ----

   -----   ------
110xxxxx 10xxxxxx
11000011 10101011
C3       AB

But someone clever might do

0000 0000 1110 1011
---- ---- ---- ----

    ----   ------   ------
1110xxxx 10xxxxxx 10xxxxxx
11100000 10000011 10101011
E0       83       AB

Applications should reject E0 83 AB (or at least convert it to C3 AB), but some don't, and that can cause security problems. Perl's Encode module treats that sequence as invalid, so it shouldn't be an issue for Perl.

0人赞添加讨论(0) 举报

够拽才男人

4楼-- · 2019-04-10 04:11

Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü can be represented either by the single character ü or by u followed by ¨, which a text renderer would then combine.

See the Wikipedia article on Unicode equivalence for gory details.

Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.

0人赞添加讨论(0) 举报

Can there be 2 different UTF-8 encodings for the s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间