Chinese Unicode issue?

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31

<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>

I'm scraping the text and trying to get 百度汇总

but when I r.encoding = 'utf-8' the result is �ٶȻ��

if I don't use utf-8 the result is °Ù¶È»ã×Ü

标签： python unicode utf-8

1条回答

傲

2楼-- · 2019-08-03 16:41

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.

You can decode it with GBK just fine:

>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True

Decoding with gb2313 fails:

>>> r.content.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence

but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.

If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:

content = str(self.content, encoding, errors='replace')

so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.

Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:

>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

0人赞添加讨论(0) 举报

Chinese Unicode issue?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间