C/C++, libxml2: parsing HTML fragments

2019-07-19 01:26发布

问题:

I need to parse real life HTML documents. In most cases they are well formed, but sometimes (and it can not be ignored) they appear as fragments having more than one sibling at the root level.
Example:

<div>one</div>
<div>two</div>

Now I use libxml2 v2.7.8 with the following parse flags:

HTML_PARSE_NOERROR | HTML_PARSE_RECOVER | HTML_PARSE_NODEFDTD | HTML_PARSE_NOIMPLIED

If I feed it with the above example and then dump HTML from the parsed document:

<div>one<div>two</div></div>

As you can see it nests the elements while my requirements are not to break the HTML. Also I'd like to be able to run XPath expression on trees created from such fragments. In this case to get to the second DIV one would use '/div[2]'.

So the question is whether it is possible to parse these kinds of HTML and how?

回答1:

I guess you need html to xml conversion. In Java I use JSoup, but stackoverflow surely knows how to do it in c. First hit: HTML to XML conversion with C++