I need to tidy up HTML pages and convert them to XML in Python; losing some "bad" parts if needed.
I used TagSoup for some time, but it doesn't understand new "article", "footer" tags, and doesn't like "meta" when they are not in the head; making resulting XML almost impossible to process.
I like what html5lib does so far, but my fifth test (very weird tests) failed; when parsing
<div attr="val"">
using html5lib + xml.dom treebuilder, I got the following in the resulting XML string:
<div attr="val" "="">
which is not a good result for well-formed xml.
When I tried html5lib + lxml as a treebuilder, I got that converted to
<div attr="val" U00022="">
which is better, but the problem is that lxml "eats" closing tags/slashes for <link>
tags, making them just <link ... >
when outputting XML.
What would you recommend to use?
You can use
to set an Element to be self-closing or not, something like this:Then just do whatever you want from it. When you're trying to write from the Element, you can also add