I have a unicode like this:
\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7
And I know it is the string representative of bytes
which is encoded with utf-8
Note that the string \xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7
itself is <type 'unicode'>
How to decode it to the real string 山东 日照
?
If you printed the
repr()
output of yourunicode
string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:
This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.
The best way to repair such mistakes is using the
ftfy
library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.For your small sample, Latin-1 appears to work just fine:
If you have the literal character
\
,x
, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with thestring_escape
codec:'string_escape'
is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.