When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages:
if 'var' in str(tag.string):
Here is the context:
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))
for tag in soup.findAll('script'):
if 'var' in str(tag.string): # This is the line throwing the exception
print(tag.string)
Here is the exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)
I have tried both with and without using the encode('utf-8')
function in the BeautifulSoup
line, it makes no difference. I do note that for the pages throwing the exception there is a character Ã
in a comment in the javascript, even though the encoding reported by response.encoding is ISO-8859-1
. I do realise that I can remove the offending characters with unicodedata.normalize however I would prefer to convert the tag
variable to utf-8
and keep the characters. None of the following methods help to change the variable to utf-8
:
tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')
What must I do to this string in order to transform it into usable utf-8
?
Ok so basically you're getting an HTTP response encoded in
Latin-1
. The character giving you problem es indeedÃ
because looking here you may see that0xC3
is exactly that character in Latin-1.I think you blinded test every combination you imagined about decoding/encoding the request. First of all, if you do this:
if 'var' in str(tag.string):
wheneverstring
var contains non-ASCII bytes, python will complaint.Looking at the code you've shared with us, the right approach IMHO would be:
EDIT: It will be useful for you to take a look at the encoding section from the BeautifiulSoup 4 doc
Basically, the logic is:
X
X
by doingbytes.decode('X') and this returns a unicode byte sequence
Y
for the outputubytes.encode('Y')
Hope this bring some light to the problem.
You can also try to use Unicode Dammit lib(it is part of BS4) to parse pages. Detailed description here: http://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html