I have some html containing mml that I am generating from Word documents using MathType. I have a python script that uses BeautifulSoup to prettify it, but the problem is it takes something like ∠
and turns it into the actual byte sequence 0xE2 0x88 0xA0
which is the ∠ symbol. This is a problem because 0xE2 0x88 0xA0
won't display as ∠ in the browser. Instead the browser interprets it as a series of latin characters. This is happening with all the math entities as well, such as Δ ∠ − +... etc.
I looked through the BeautifulSoup documentation and I can see how to turn entities into the byte sequences, but I'm not using that command; all I'm using is prettify(). And I didn't see a way in the BeautifulSoup documentation to not turn entities into byte sequences.
Does anyone know if there's a setting in BeautifulSoup to tell it not to change entities to byte sequences? I hope so because it seems kind of dumb to have to undo the damage after prettify runs :)
Thanks in advance for your help!
I missed part of the BeautifulSoup documentation. The default output formatters do the described behaviour: they turn html entities into the unicode characters. So, this behaviour can be changed by using a different output formatter. (D'oh)
"You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode()...."
So if I pass in the
formatter="html"
Beautiful Soup will convert Unicode characters to HTML entities whenever possible! Yay! Thank you Beautiful Soup!(And they have such great documentation. Pity I didn't read the whole thing sooner. :$)