UnicodeDecodeError, invalid continuation byte

2019-01-01 00:16发布

Why is the below item failing? and why does it succeed with "latin-1" codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

5条回答
闭嘴吧你
2楼-- · 2019-01-01 00:54

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

查看更多
倾城一夜雪
3楼-- · 2019-01-01 00:57

I had the same error when I tried to open a csv file by pandas read_csv method.

The solution was change the encoding to 'latin-1':

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
查看更多
与君花间醉酒
4楼-- · 2019-01-01 00:57

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
查看更多
后来的你喜欢了谁
5楼-- · 2019-01-01 00:58

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.

If you can't do that, you'll need heuristics.

查看更多
孤独寂梦人
6楼-- · 2019-01-01 01:00

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

查看更多
登录 后发表回答