This question already has an answer here:
- How to determine the encoding of text? 9 answers
I'm scraping news articles from various sites, using GAE and Python.
The code where I scrape one article url at a time leads to the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
Here's my code in its simplest form:
from google.appengine.api import urlfetch
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
return result.content
Here is another variant I have tried, with the same result:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
s = s.decode('utf-8')
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
Here's the ugly, brittle one, which also doesn't work:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
try:
s = s.decode('iso-8859-1')
except:
pass
try:
s = s.decode('ascii')
except:
pass
try:
s = s.decode('GB2312')
except:
pass
try:
s = s.decode('Windows-1251')
except:
pass
try:
s = s.decode('Windows-1252')
except:
s = "did not work"
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
The last variant returns s as the string "did not work" from the last except.
So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?
Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.
Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.
It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.
I had the same problem some time ago and there is nothing 100% accurate. What I did was: