Fixing mojibakes in UTF-8 text

2020-04-16 05:31发布

I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake:

IDENTIFICAÌàÌÄO instead of identificação
AndrÃ© instead of André

Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually?

标签： python utf-8 character-encoding mojibake

1条回答

闹够了就滚

2楼-- · 2020-04-16 05:58

"AndrÃ©" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding:

>>> 'AndrÃ©'.encode('latin-1').decode('utf-8')
'André'

All cases following this pattern can be fixed like that.

However, I can't explain the other case (with "Ìà" for "ç" and "ÌÄ" for "ã"), and therefore can't provide a solution. If you can find a codec where "Ì", "à", and "Ä" have the codepoints C3, A7, and A3, respectively, then you can use this instead of Latin-1 for fixing the text.

0人赞添加讨论(0) 举报

Fixing mojibakes in UTF-8 text

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间