Scraping a website whose encoding is iso-8859-1 in

I'd like to scrape a website using Python that is full of horrible problems, one being the wrong encoding at the top:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

This is wrong because the page is full of occurrences like the following:

Nellâ€™ambito

instead of

Nell'ambito (please notice â€™ replaces ')

If I understand correctly, this is happening because utf-8 bytes (probably the database encoding) are interpreted as iso-8859-1 bytes (forced by the charset in the meta tag). I found some initial explanation at this link http://www.i18nqa.com/debug/utf8-debug.html

I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests, however all I need is to understand what is the correct way to store in my database a string that fixes â€™ by encoding the string to '.

标签： python unicode utf-8 beautifulsoup

1条回答

成全新的幸福

2楼-- · 2019-08-09 01:01

I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests

Are you feeding the encoding from the Content-Type HTTP header into BeautifulSoup?

If an HTML page has both a Content-Type header and a meta tag, the header should ‘win’, so if you're only taking the meta tag you may get the wrong encoding.

Otherwise, you could either feed the fixed encoding 'utf-8' into Beautiful, or fix up each string indvidually.

Annoying note: it's not actually ISO-8859-1. When web pages say ISO-8859-1, browsers actually take it to mean Windows code page 1252, which is similar to 8859-1 but not the same. The € would seem to indicate cp1252 because it's not present in 8859-1.

u'Nellâ€™ambito'.encode('cp1252').decode('utf-8')

If the content is encoded inconsistently with some UTF-8 and some cp1252 on the same page (typically due to poor database content handling), this would be the only way to recover it, catching UnicodeError and returning the original string when it wouldn't transcode.

0人赞添加讨论(0) 举报

Scraping a website whose encoding is iso-8859-1 in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间