Chinese character encoding error with BeautifulSou

2019-07-13 15:52发布

问题:

I'd like to use BeatifulSoup to get the data in a table from a website, but it couldn't grab the Chinese character correctly. This is my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.515fa.com/che_1978.html").read()
soup=BeautifulSoup(html,from_encoding="UTF-8")
print soup.prettify()

And the Chinese characters are displayed like this:

<td align="center" bgcolor="#FFFFFF" u1:str="" width="173">
               ćé¸</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="149">
               ä¸ćľˇĺ¤§äź</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="126">
               大äź</td>

I really don't know what the "ä¸ćľˇĺ¤§äź" is. I tried to change the encoding "utf-8" to "gb18030", but it didn't work. How can I get the correct Chinese characters? Thanks!

回答1:

Try:

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Not sure what exactly BeautifulSoup(from_encoding=) did but this did the trick.