I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).
All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.
This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?
$ "\xC3\x83\xC2\xAB"
ë
$ use Encode
$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë
You have double-encoded UTF-8. Encode::Repair is one way to deal with that.
Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü
can be represented either by the single character ü
or by u
followed by ¨
, which a text renderer would then combine.
See the Wikipedia article on Unicode equivalence for gory details.
Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.
(I'm answering your subject question, "Can there be 2 different UTF-8 encodings for the same character?", which is significantly different from the question inside the post.)
("Character" usually means string element. It's ambiguous at beast, and it's not the right word to use here. The Unicode term for a visual representation, a glyph, is "grapheme".)
Yes, there are more than sequence of code points can result in the same grapheme. For example, both
U+00EB LATIN SMALL LETTER E WITH DIAERESIS
and
U+0065 LATIN SMALL LETTER E
U+0308 COMBINING DIAERESIS
should display as "ë". Let's see how your browser does:
- U+00EB: "ë"
- U+0065,0308: "ë"
In UTF-8, these code points would be encoded as
- U+00EB: C3 AB
- U+0065: 65
- U+0308: CC 88
One would use Unicode::Normalize's NFC
or NFD
to normalize a string to one of two formats (your choice).
$ perl -MUnicode::Normalize -E'
$x = "\x{00EB}";
$y = "\x{0065}\x{0308}";
say $x eq $y ?1:0;
say NFC($x) eq NFC($y) ?1:0;
say NFD($x) eq NFD($y) ?1:0;
'
0
1
1
There's also something called "overlong" encodings in UTF-8. (Specifically UTF-8, not Unicode in general.) In UTF-8, Unicode code points are encoded using one of the four following bit patterns:
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The "x"s represent the code point to encode. One must use the shortest possible, so U+00EB would be
0000 0000 1110 1011
--- ---- ----
----- ------
110xxxxx 10xxxxxx
11000011 10101011
C3 AB
But someone clever might do
0000 0000 1110 1011
---- ---- ---- ----
---- ------ ------
1110xxxx 10xxxxxx 10xxxxxx
11100000 10000011 10101011
E0 83 AB
Applications should reject E0 83 AB (or at least convert it to C3 AB), but some don't, and that can cause security problems. Perl's Encode module treats that sequence as invalid, so it shouldn't be an issue for Perl.