How to deal with unknown encoding when scraping we

2019-07-24 19:53发布

This question already has an answer here:

I'm scraping news articles from various sites, using GAE and Python.

The code where I scrape one article url at a time leads to the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)

Here's my code in its simplest form:

from google.appengine.api import urlfetch

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        return result.content

Here is another variant I have tried, with the same result:

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        s = result.content
        s = s.decode('utf-8')
        s = s.encode('utf-8')
        s = unicode(s,'utf-8')
        return s

Here's the ugly, brittle one, which also doesn't work:

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        s = result.content

        try:
            s = s.decode('iso-8859-1')
        except:
            pass
        try:
            s = s.decode('ascii')
        except: 
            pass
        try:
            s = s.decode('GB2312')
        except:
            pass
        try:
            s = s.decode('Windows-1251')
        except:
            pass
        try:
            s = s.decode('Windows-1252')
        except:
            s = "did not work"

        s = s.encode('utf-8')
        s = unicode(s,'utf-8')
        return s

The last variant returns s as the string "did not work" from the last except.

So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?

Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.

Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.

2条回答
太酷不给撩
2楼-- · 2019-07-24 20:34

It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.

查看更多
叼着烟拽天下
3楼-- · 2019-07-24 20:52

I had the same problem some time ago and there is nothing 100% accurate. What I did was:

  • Get encoding from Content-Type
  • Get encoding from meta tags
  • Detect encoding with chardet Python module
  • Decode text from the most common encoding to Unicode
  • Process the text/html
查看更多
登录 后发表回答