I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?
相关问题
- $ENV{$variable} in perl
- ruby 1.9 wrong file encoding on windows
- Is it possible to pass command-line arguments to @
- WebElement.getText() function and utf8
- How to convert a string to a byte array which is c
相关文章
- iconv() Vs. utf8_encode()
- Running a perl script on windows without extension
- Comparing speed of non-matching regexp
- When sending XML to JMS should I use TextMessage o
- Google app engine datastore string encoding proble
- Can NOT List directory including space using Perl
- Extracting columns from text file using Perl one-l
- Lazy (ungreedy) matching multiple groups using reg
Recently I came across files with a severe mix of UTF-8, CP1252, and UTF-8 encoded, then interpreted as CP1252, then that encoded as UTF-8 again, that interpreted as CP1252 again, and so forth.
I wrote the below code, which worked well for me. It looks for typical UTF-8 byte sequences, even if some of the bytes are not UTF-8, but the Unicode representation of the equivalent CP1252 byte.
This has similar limitations as ikegami's answer, except that the same limitations are also applicable to UTF-8 encoded strings.
Yes!
Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.
A line can contain a mix of encodings
Encoding::FixLatin provides a function named
fix_latin
which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.Heuristics are employed, but they are fairly reliable. Only the following cases will fail:
One of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
encoded using iso-8859-1 or cp1252, followed by one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]encoded using iso-8859-1 or cp1252.
One of
[àáâãäåæçèéêëìíîï]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]encoded using iso-8859-1 or cp1252.
One of
[ðñòóôõö÷]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]encoded using iso-8859-1 or cp1252.
The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.
Each line only uses one encoding
fix_latin
works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:
The line is encoded using iso-8859-1 or cp1252,
At least one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]is present in the line,
All instances of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
are always followed by exactly one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],All instances of
[àáâãäåæçèéêëìíîï]
are always followed by exactly two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],All instances of
[ðñòóôõö÷]
are always followed by exactly three of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],None of
[øùúûüýþÿ]
are present in the line, and
None of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]are present in the line except where previously mentioned.
Notes:
fix_latin
to convert files, and it would be trivial to write one using the second approach.fix_latin
(both the function and the file) can be sped up by installing Encoding::FixLatin::XS.This is one of the reasons I wrote Unicode::UTF8. With Unicode::UTF8 this is trivial using the fallback option in Unicode::UTF8::decode_utf8().
Output:
Unicode::UTF8 is written in C/XS and only invokes the callback/fallback when encountering an Ill-formed UTF-8 sequence.