From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>
I'm scraping the text and trying to get 百度汇总
but when I r.encoding = 'utf-8'
the result is �ٶȻ���
if I don't use utf-8
the result is °Ù¶È»ã×Ü
The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:
GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.
You can decode it with GBK just fine:
Decoding with
gb2313
fails:but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.
If you are using
requests
, then settingr.encoding
togb2312
works becauser.text
usesreplace
when handling decode errors:so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.
Note that BeautifulSoup can do the decoding all by itself; it'll find the
meta
header:The warning is caused by the GBK codepoints being used while the page claims to use GB2312.