UnicodeDecodeError, invalid continuation byte

Why is the below item failing? and why does it succeed with "latin-1" codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

标签： python unicode decode

5条回答

闭嘴吧你

2楼-- · 2019-01-01 00:54

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

0人赞添加讨论(0) 举报

倾城一夜雪

3楼-- · 2019-01-01 00:57

I had the same error when I tried to open a csv file by pandas read_csv method.

The solution was change the encoding to 'latin-1':

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

0人赞添加讨论(0) 举报

与君花间醉酒

4楼-- · 2019-01-01 00:57

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

0人赞添加讨论(0) 举报

后来的你喜欢了谁

5楼-- · 2019-01-01 00:58

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.

If you can't do that, you'll need heuristics.

0人赞添加讨论(0) 举报

孤独寂梦人

6楼-- · 2019-01-01 01:00

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

0人赞添加讨论(0) 举报

UnicodeDecodeError, invalid continuation byte

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间