Unwanted replacement of html entities by Beautiful

2019-08-02 04:00发布

I have some html containing mml that I am generating from Word documents using MathType. I have a python script that uses BeautifulSoup to prettify it, but the problem is it takes something like ∠ and turns it into the actual byte sequence 0xE2 0x88 0xA0 which is the ∠ symbol. This is a problem because 0xE2 0x88 0xA0 won't display as ∠ in the browser. Instead the browser interprets it as a series of latin characters. This is happening with all the math entities as well, such as Δ ∠ − +... etc.

I looked through the BeautifulSoup documentation and I can see how to turn entities into the byte sequences, but I'm not using that command; all I'm using is prettify(). And I didn't see a way in the BeautifulSoup documentation to not turn entities into byte sequences.

Does anyone know if there's a setting in BeautifulSoup to tell it not to change entities to byte sequences? I hope so because it seems kind of dumb to have to undo the damage after prettify runs :)

Thanks in advance for your help!

1条回答
Summer. ? 凉城
2楼-- · 2019-08-02 04:12

I missed part of the BeautifulSoup documentation. The default output formatters do the described behaviour: they turn html entities into the unicode characters. So, this behaviour can be changed by using a different output formatter. (D'oh)

"You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode()...."

So if I pass in the formatter="html" Beautiful Soup will convert Unicode characters to HTML entities whenever possible! Yay! Thank you Beautiful Soup!

(And they have such great documentation. Pity I didn't read the whole thing sooner. :$)

查看更多
登录 后发表回答