I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).
All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.
This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?
You have double-encoded UTF-8. Encode::Repair is one way to deal with that.
(I'm answering your subject question, "Can there be 2 different UTF-8 encodings for the same character?", which is significantly different from the question inside the post.)
("Character" usually means string element. It's ambiguous at beast, and it's not the right word to use here. The Unicode term for a visual representation, a glyph, is "grapheme".)
Yes, there are more than sequence of code points can result in the same grapheme. For example, both
and
should display as "ë". Let's see how your browser does:
In UTF-8, these code points would be encoded as
One would use Unicode::Normalize's
NFC
orNFD
to normalize a string to one of two formats (your choice).There's also something called "overlong" encodings in UTF-8. (Specifically UTF-8, not Unicode in general.) In UTF-8, Unicode code points are encoded using one of the four following bit patterns:
The "x"s represent the code point to encode. One must use the shortest possible, so U+00EB would be
But someone clever might do
Applications should reject E0 83 AB (or at least convert it to C3 AB), but some don't, and that can cause security problems. Perl's Encode module treats that sequence as invalid, so it shouldn't be an issue for Perl.
Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u
ü
can be represented either by the single characterü
or byu
followed by¨
, which a text renderer would then combine.See the Wikipedia article on Unicode equivalence for gory details.
Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.