I need to parse real life HTML documents. In most cases they are well formed, but sometimes (and it can not be ignored) they appear as fragments having more than one sibling at the root level.
Example:
<div>one</div>
<div>two</div>
Now I use libxml2 v2.7.8 with the following parse flags:
HTML_PARSE_NOERROR | HTML_PARSE_RECOVER | HTML_PARSE_NODEFDTD | HTML_PARSE_NOIMPLIED
If I feed it with the above example and then dump HTML from the parsed document:
<div>one<div>two</div></div>
As you can see it nests the elements while my requirements are not to break the HTML. Also I'd like to be able to run XPath expression on trees created from such fragments. In this case to get to the second DIV one would use '/div[2]'.
So the question is whether it is possible to parse these kinds of HTML and how?