How to deal with unknown encoding when scraping we

This question already has an answer here:

How to determine the encoding of text? 9 answers

I'm scraping news articles from various sites, using GAE and Python.

The code where I scrape one article url at a time leads to the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)

Here's my code in its simplest form:

from google.appengine.api import urlfetch

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        return result.content

Here is another variant I have tried, with the same result:

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        s = result.content
        s = s.decode('utf-8')
        s = s.encode('utf-8')
        s = unicode(s,'utf-8')
        return s

Here's the ugly, brittle one, which also doesn't work:

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        s = result.content

        try:
            s = s.decode('iso-8859-1')
        except:
            pass
        try:
            s = s.decode('ascii')
        except: 
            pass
        try:
            s = s.decode('GB2312')
        except:
            pass
        try:
            s = s.decode('Windows-1251')
        except:
            pass
        try:
            s = s.decode('Windows-1252')
        except:
            s = "did not work"

        s = s.encode('utf-8')
        s = unicode(s,'utf-8')
        return s

The last variant returns s as the string "did not work" from the last except.

So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?

Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.

Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.

标签： python google-app-engine unicode

2条回答

太酷不给撩

2楼-- · 2019-07-24 20:34

It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2019-07-24 20:52

I had the same problem some time ago and there is nothing 100% accurate. What I did was:

Get encoding from Content-Type
Get encoding from meta tags
Detect encoding with chardet Python module
Decode text from the most common encoding to Unicode
Process the text/html

0人赞添加讨论(0) 举报

How to deal with unknown encoding when scraping we

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间