I've been using the excellent bleach library for removing bad HTML.
I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:
<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>
Using bleach (with the style
tag implicitly disallowed), leaves me with:
st1:*{behavior:url(#ieooui) }
Which isn't helpful. Bleach seems only to have options to:
- Escape tags;
- Remove the tags (but not their contents).
I'm looking for a third option - remove the tags and their contents.
Is there any way to use bleach or html5lib to completely remove the style
tag and its contents? The documentation for html5lib isn't really a great deal of help.