I have this string that has been decoded from Quoted-printable to ISO-8859-1 with the email module. This gives me strings like "\xC4pple" which would correspond to "Äpple" (Apple in Swedish). However, I can't convert those strings to UTF-8.
>>> apple = "\xC4pple"
>>> apple
'\xc4pple'
>>> apple.encode("UTF-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
What should I do?
Decode to Unicode, encode the results to UTF8.
This is a common problem, so here's a relatively thorough illustration.
For non-unicode strings (i.e. those without
u
prefix likeu'\xc4pple'
), one must decode from the native encoding (iso8859-1
/latin1
, unless modified with the enigmaticsys.setdefaultencoding
function) tounicode
, then encode to a character set that can display the characters you wish, in this case I'd recommendUTF-8
.First, here is a handy utility function that'll help illuminate the patterns of Python 2.7 string and unicode:
A plain string
Decoding a iso8859-1 string - convert plain string to unicode
A little more illustration — with “Ä”
Encoding to UTF
Relationship between unicode and UTF and latin1
Unicode Exceptions
One would get around these by converting from the specific encoding (latin-1, utf8, utf16) to unicode e.g.
u8.decode('utf8').encode('latin1')
.So perhaps one could draw the following principles and generalizations:
str
is a set of bytes, which may have one of a number of encodings such as Latin-1, UTF-8, and UTF-16unicode
is a set of bytes that can be converted to any number of encodings, most commonly UTF-8 and latin-1 (iso8859-1)print
command has its own logic for encoding, set tosys.stdout.encoding
and defaulting to UTF-8str
to unicode before converting to another encoding.Of course, all of this changes in Python 3.x.
Hope that is illuminating.
Further reading
And the very illustrative rants by Armin Ronacher:
I do this, I am not sure if that is a good approach but it works everytime !!
Try decoding it first, then encoding:
For Python 3:
I used this for a text incorrectly encoded as iso-8859-1 (showing words like VeÅ\x99ejné) instead of utf-8. This code produces correct version Veřejné.