Recovering UTF-8 from broken ISO-Latin-1 sequence

2019-09-04 08:47发布

I have recently been encountering several broken UTF-8 strings that got converted to what I believe is ISO-Latin-1, and I was wondering if there was some tool out there already that could be used to convert back automatically, since no information is actually destroyed, no bits are actually lost.

Essentially something like this would take a sequence of characters and display what they would have been if those same bits had been displayed as utf-8 or some other encoding. Does such a tool exist? (I know it would be easy to create something to do it myself, or even to just do it manually, so I will probably do that if there really isn't anything.)

To clarify: The particular case I am having is that on a particular forum the text editor allows utf-8 characters, but the forum itself then displays the characters that correspond to the individual bytes of the utf-8 character.

For characters U+0000 to U+007F it is the exact same character, but:

  • U+0080 to U+07FF characters are instead displayed as one character between U+00C0 and U+00DF followed by one character between U+0080 and U+00BF
  • U+0800 to U+FFFF characters are instead displayed as one character between U+00E0 and U+00EF followed by two characters between U+0080 to U+00BF

and so on...

So "�" should actually be displayed as the character U+2xy6, (x is the middle 4 bits of '�', y is the last 2 bits of '�' plus '10').

Although I still can't figure out exactly which of the characters between U+0080 and U+00BF '�' is.

What I am trying to do is take all of a UTF-8 string's character's ISO-Latin-1 bit values, concatenate them all together, and interpret the resulting bit sequence as if it contained UTF-8 encoded characters.

2条回答
劳资没心,怎么记你
2楼-- · 2019-09-04 09:25

UTF8 -> latin1 is lossy unfortunately. UTF8 parsed as latin1 -> UTF8 is not. I assume this is your case. If so then you can just on linux reverse like:

iconv -f utf8 -t iso-8859-1 < bad.file.latin1 > good.file.utf8

If the intermediate conversion was something lossy like cp1252 then the processes is more involved, and will require something like is detailed at:

http://www.pixelbeat.org/docs/unicode_utils/

查看更多
▲ chillily
3楼-- · 2019-09-04 09:27

Sorry to say, but this does not make a whole lot of sense. :)

Scenario 1: A string like "Héllö wörld", which contains characters valid in both UTF-8 and Latin1, was properly converted from UTF-8 to Latin1: no problem. You just need to interpret it in Latin1 now.

Scenario 2: A string like "Hello 世界", which contains characters valid in UTF-8 but not in Latin1, was properly converted from UTF-8 to Latin1: in this case, the characters which are not representable in Latin1 likely have been replaced by ?, i.e. the string is now "Hello ??" and there's nothing you can do about it.

Scenario 3: A string like "Héllö 世界", which contains any sort of characters and was saved as UTF-8, was converted from assumed Latin1 to UTF-8. That means the characters have been misinterpreted but are now properly encoded UTF-8: "Héllö ä¸ç". In this case, you can reverse the encoding UTF-8 → Latin1 and interpret the result as UTF-8 to get the original back.

Scenario 4: A string like "Héllö Wörld", which contains Latin1 characters and was saved as Latin1, was misinterpreted as UTF-8, then saved as UTF-8, in which case it's now "H�ll� W�rld". This string is now irrecoverable.

There are many more possible combinations of what happened, it's impossible to tell you exactly what can or can't be done without more information. First of all, make sure you are interpreting the string correctly now and it's not simply a display issue.

The fact that you're seeing a "�" in there points towards that you are trying to interpret something as UTF-8, but the UTF-8 decoder can not make sense of these characters and replaces them with "�". This is either your fault now and the data is fine, or it's scenario 4.

查看更多
登录 后发表回答