How to convert a string from CP-1251 to UTF-8?

2019-02-03 01:37发布

I'm using mutagen to convert ID3 tags data from CP-1251/CP-1252 to UTF-8. In Linux there is no problem. But on Windows, calling SetValue() on a wx.TextCtrl produces the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

The original string (assumed to be CP-1251 encoded) that I'm pulling from mutagen is:

u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'

I've tried converting this to UTF-8:

dd = d.decode('utf-8')

...and even changing the default encoding from ASCII to UTF-8:

sys.setdefaultencoding('utf-8')

...But I get the same error.

6条回答
Anthone
2楼-- · 2019-02-03 01:47

I lost half of my day to find correct answer. So if you got some unicode string from external source windows-1251 encoded (from web site in my situation) you will see in Linux console something like this:

u'\u043a\u043e\u043c\u043d\u0430\u0442\u043d\u0430\u044f \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430.....'

This is not correct unicode presentation of your data. So, Tim Pietzcker is right. You should encode() it first then decode() and then encode again to correct encoding.

So in my case this strange line was saved in "text" variable, and line:

print text.encode("cp1251").decode('cp1251').encode('utf8')   

gave me:

"Своя 2-х комнатная квартира с отличным ремонтом...."

Yes, it makes me crazy too. But it works!

P.S. Saving to file you should do the same way.

some_file.write(text.encode("cp1251").decode('cp1251').encode('utf8'))
查看更多
相关推荐>>
3楼-- · 2019-02-03 01:55

Your string d is a Unicode string, not a UTF-8-encoded string! So you can't decode() it, you must encode() it to UTF-8 or whatever encoding you need.

>>> d = u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> d
u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> print d
Áåëàÿ ÿáëûíÿ ãðîìó
>>> a.encode("utf-8")
'\xc3\x81\xc3\xa5\xc3\xab\xc3\xa0\xc3\xbf \xc3\xbf\xc3\xa1\xc3\xab\xc3\xbb\xc3\xad\xc3\xbf \xc3\xa3\xc3\xb0\xc3\xae\xc3\xac\xc3\xb3'

(which is something you'd do at the very end of all processing when you need to save it as a UTF-8 encoded file, for example).

If your input is in a different encoding, it's the other way around:

>>> d = "Schoßhündchen"                 # native encoding: cp850
>>> d = "Schoßhündchen".decode("cp850") # decode from Windows codepage
>>> d                                   # into a Unicode string (now work with this!)
u'Scho\xdfh\xfcndchen'
>>> print d                             # it displays correctly if your shell knows the glyphs
Schoßhündchen
>>> d.encode("utf-8")                   # before output, convert to UTF-8
'Scho\xc3\x9fh\xc3\xbcndchen'
查看更多
姐就是有狂的资本
4楼-- · 2019-02-03 01:55

I'd rather add a comment to Александр Степаненко answer but my reputation doesn't yet allow it. I had similar problem of converting MP3 tags from CP-1251 to UTF-8 and the solution of encode/decode/encode worked for me. Except for I had to replace first encoding with "latin-1", which essentially converts Unicode string into byte sequence without real encoding:

print text.encode("latin-1").decode('cp1251').encode('utf8')

and for saving back using for example mutagen it doesn't need to be encoded:

audio["title"] = title.encode("latin-1").decode('cp1251')
查看更多
姐就是有狂的资本
5楼-- · 2019-02-03 01:58

I provided some relevant info on encoding/decoding text in this response: https://stackoverflow.com/a/34662963/2957811

To add to that here, it's important to think of text in one of two possible states: 'encoded' and 'decoded'

'decoded' means it is in an internal representation by your interpreter/libraries that can be used for character manipulation (e.g. searches, case conversion, substring slicing, character counts, ...) or display (looking up a code point in a font and drawing the glyph), but cannot be passed in or out of the running process.

'encoded' means it is a byte stream that can be passed around as can any other data, but is not useful for manipulation or display.

If you've worked with serialized objects before, consider 'decoded' to be the useful object in memory and 'encoded' to be the serialized version.

'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3' is your encoded (or serialized) version, presumably encoded with cp1251. This encoding needs to be right because that's the 'language' used to serialize the characters and is needed to recreate the characters in memory.

You need to decode this from it's current encoding (cp1251) into python unicode characters, then re-encode it as a utf8 byte stream. The answerer that suggested d.decode('cp1251').encode('utf8') had this right, I am just hoping to help explain why that should work.

查看更多
Ridiculous、
6楼-- · 2019-02-03 02:02

If you know for sure that you have cp1251 in your input, you can do

d.decode('cp1251').encode('utf8')
查看更多
手持菜刀,她持情操
7楼-- · 2019-02-03 02:10

If d is a correct Unicode string, then d.encode('utf-8') yields an encoded UTF-8 bytestring. Don't test it by printing, though, it might be that it just doesn't display properly because of the codepage shenanigans.

查看更多
登录 后发表回答