I'm using bs4 to do some work on some text, but in some cases it converts
characters to Â
. The best I can tell is that this is an encoding mismatch from UTF-8 to latin1 (or reverse?)
Everything in my web app is UTF-8, Python3 is UTF-8, and I've confirmed the database is UTF-8.
I've narrowed down the problem to this one line:
print("Before soup: " + text) # Before soup:
soup = BeautifulSoup(text, "html.parser")
#.... do stuff to soup, but all commented out for this testing.
soup = BeautifulSoup(soup.renderContents(), "html.parser") # <---- PROBLEM!
print(soup.renderContents()) # b'\xc3\x82\xc2\xa0'
print("After SOUP: " + str(soup)) # After SOUP: Â
How do I prevent renderContents() from changing the encoding? There is no documentation on this function!
Edit: Upon further research into the docs, this seems to be the key, but I still can't fix the problem!
print(soup.prettify(formatter="html")) # Â