I'd like to scrape a website using Python that is full of horrible problems, one being the wrong encoding at the top:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
This is wrong because the page is full of occurrences like the following:
Nell’ambito
instead of
Nell'ambito
(please notice ’
replaces '
)
If I understand correctly, this is happening because utf-8 bytes (probably the database encoding) are interpreted as iso-8859-1 bytes (forced by the charset in the meta tag). I found some initial explanation at this link http://www.i18nqa.com/debug/utf8-debug.html
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests, however all I need is to understand what is the correct way to store in my database a string that fixes ’
by encoding the string to '
.
Are you feeding the encoding from the
Content-Type
HTTP header into BeautifulSoup?If an HTML page has both a Content-Type header and a meta tag, the header should ‘win’, so if you're only taking the meta tag you may get the wrong encoding.
Otherwise, you could either feed the fixed encoding
'utf-8'
into Beautiful, or fix up each string indvidually.Annoying note: it's not actually ISO-8859-1. When web pages say ISO-8859-1, browsers actually take it to mean Windows code page 1252, which is similar to 8859-1 but not the same. The
€
would seem to indicate cp1252 because it's not present in 8859-1.If the content is encoded inconsistently with some UTF-8 and some cp1252 on the same page (typically due to poor database content handling), this would be the only way to recover it, catching
UnicodeError
and returning the original string when it wouldn't transcode.