python urllib2 utf-8 encoding

2019-04-02 11:07发布

okay, I have: # -*- coding: utf-8 -*- in my python file.

the snippet:

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.addheaders = [('Accept-Charset', 'utf-8')]
f =opener.open(url)
doc = f.read().decode('utf-8')

The server response is: (via f.info())

Content-Type: text/html; charset=UTF-8

but i get the error:

UnicodeDecodeError: 'utf8' codec can't decode byte[...]: invalid continuation byte

What's wrong here?

标签： python encoding utf-8 urllib2

2条回答

Ridiculous、

2楼-- · 2019-04-02 11:55

Try decoding the data using 'latin-1' to see what it looks like. What you're seeing indicates a UTF-8 decode error (see UnicodeDecodeError, invalid continuation byte ).

It would be helpful if you posted the result of list(f.read())[:100] so we can see the data.

FYI, putting # -*- coding: utf-8 -*- is unrelated to your issue. That encoding refers to the encoding of your python script itself, not the data it is handling :-)

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-04-02 12:03

That particular error is commonly caused by trying to decode using utf-8 when the string was actually encoded with latin1. See UnicodeDecodeError, invalid continuation byte for some more info.

I suspect that despite the header, the server is not returning utf8 encoded content.

A solution that might be worth pursuing is to use chardet to 'guess' which encoding is used. Despite chardet's awesomeness consider it a last resort however.

0人赞添加讨论(0) 举报

python urllib2 utf-8 encoding

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间