When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages:
if 'var' in str(tag.string):
Here is the context:
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))
for tag in soup.findAll('script'):
if 'var' in str(tag.string): # This is the line throwing the exception
print(tag.string)
Here is the exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)
I have tried both with and without using the encode('utf-8')
function in the BeautifulSoup
line, it makes no difference. I do note that for the pages throwing the exception there is a character Ã
in a comment in the javascript, even though the encoding reported by response.encoding is ISO-8859-1
. I do realise that I can remove the offending characters with unicodedata.normalize however I would prefer to convert the tag
variable to utf-8
and keep the characters. None of the following methods help to change the variable to utf-8
:
tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')
What must I do to this string in order to transform it into usable utf-8
?
Ok so basically you're getting an HTTP response encoded in Latin-1
. The character giving you problem es indeed Ã
because looking here you may see that 0xC3
is exactly that character in Latin-1.
I think you blinded test every combination you imagined about decoding/encoding the request. First of all, if you do this: if 'var' in str(tag.string):
whenever string
var contains non-ASCII bytes, python will complaint.
Looking at the code you've shared with us, the right approach IMHO would be:
response = requests.get(url)
# decode the latin-1 bytes to unicode
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
for tag in soup.findAll('script'):
# since now soup was made with unicode strings I supposed you can treat
# its elements as so
if u'var' in tag.string: # This is the line throwing the exception
# now if you want output in utf-8
print(tag.string.encode('utf-8'))
EDIT: It will be useful for you to take a look at the encoding section from the BeautifiulSoup 4 doc
Basically, the logic is:
- You get some bytes encoded in encoding
X
- You decode
X
by doing bytes.decode('X') and this returns a unicode byte sequence
- You work with unicode
- You encode the unicode to some encoding
Y
for the output ubytes.encode('Y')
Hope this bring some light to the problem.
You can also try to use Unicode Dammit lib(it is part of BS4) to parse pages. Detailed description here: http://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html