I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'
So you have to get the header, and split it. Twice.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
... charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'
That's a surprising amount of steps for such a basic function. Am I missing something?
I would go with chardet Universal Encoding Detector.
You are doing right but your approach would fail for pages where charset is declared on
meta
tag or is not declared at all.If you look closer at Chardet sources, it has a
charsetprober/charsetgroupprober
modules that deals with this problem nicely.Have you checked this?
How to download any(!) webpage with correct charset in python?
I did some research and came up with this solution:
This is how I would do it in Python 3. I haven't tested it in Python 2 but I am guessing that you would have to use
urllib2.request
instead ofurllib.request
.Here is how it works, since the official Python documentation doesn't explain it very well: the result of
urlopen
is anhttp.client.HTTPResponse
object. Theheaders
property of this object is anhttp.client.HTTPMessage
object, which, according to the documentation, "is implemented using theemail.message.Message
class", which has a method calledget_content_charset
, which tries to determine and return the character set of the response.By default, this method returns
None
if it is unable to determine the character set, but you can override this behavior instead by passing afailobj
parameter:You're not missing anything. It's doing the right thing - encoding of a HTTP response is a subpart of Content-Type.
Note also that some pages might send only
Content-Type: text/html
and then set the encoding via<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- that's an ugly hack though (on the part of the page author) and is not too common.