I'm using bs4 to do some work on some text, but in some cases it converts
characters to Â
. The best I can tell is that this is an encoding mismatch from UTF-8 to latin1 (or reverse?)
Everything in my web app is UTF-8, Python3 is UTF-8, and I've confirmed the database is UTF-8.
I've narrowed down the problem to this one line:
print("Before soup: " + text) # Before soup:
soup = BeautifulSoup(text, "html.parser")
#.... do stuff to soup, but all commented out for this testing.
soup = BeautifulSoup(soup.renderContents(), "html.parser") # <---- PROBLEM!
print(soup.renderContents()) # b'\xc3\x82\xc2\xa0'
print("After SOUP: " + str(soup)) # After SOUP: Â
How do I prevent renderContents() from changing the encoding? There is no documentation on this function!
Edit: Upon further research into the docs, this seems to be the key, but I still can't fix the problem!
print(soup.prettify(formatter="html")) # Â
Ok, apparently I hadn't read deep enough in to the docs, here's where the answer can be found:
From https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings:
The problem is that the snippet of code provided to BS is so short, that BeautifulSoup's sub-library
Unicode, Dammit
, doesn't have enough info to properly guess the encoding.So the key is to add
from_encoding="UTF-8"
to each time the BS is constructed:soup = BeautifulSoup(soup.renderContents(), "html.parser", from_encoding="UTF-8")