I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.
Using BeautifulSoup you can parse the HTML and access the
original_encoding
attrbute:And this agrees with the encoding declared in the
<meta>
tag in the HTML's<head>
:Now you can decode the HTML:
but there not much point since the HTML is already available as unicode:
It is also possible to attempt to detect it using the
chardet
module (although it is a bit slow):