I am trying to get the character encoding for pages that I scrape, but in some cases it is failing. Here is what I am doing:
resp = urllib2.urlopen(request)
self.COOKIE_JAR.extract_cookies(resp, request)
content = resp.read()
encodeType= resp.headers.getparam('charset')
resp.close()
That is my first attempt. But if charset comes back as type None
, I do this:
soup = BeautifulSoup(html)
if encodeType == None:
try:
encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
except AttributeError, e:
print e
try:
encodeType = soup.findAll('meta', {'charset':lambda v:v.lower() != None})
except AttributeError, e:
print e
if encodeType == '':
encodeType = 'iso-8859-1'
The page I am testing has this in the header:
<meta charset="ISO-8859-1">
I would expect the first try statement to return an empty string, but I get this error on both try statements (which is why the 2nd statement is nested for now):
'NoneType' object has no attribute 'lower'
What is wrong with the 2nd try statement? I am guessing the 1st one is incorrect as well, since it's throwing an error and not just coming back blank.
OR better yet is there a more elegant way to just remove any special character encoding from a page? My end result I am trying to accomplish is that I don't care about any of the specially encoded characters. I want to delete encoded characters and keep the raw text. Can I skip all of the above an tell BeautifulSoup to just strip anything that is encoded?
When attempting to determine the character encoding of a page, I believe the order that should be tried is:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
)Content-Type: text/html; charset=ISO-8859-1
)I decided to just go with whatever BeautifulSoup spits out. Then as I parse through each word in the document, if I can't convert it to a string, I just disregard it.