I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes.
For example, I want to take the following fragment:
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
Into this:
<section>
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
</section>
<section>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
</section>
Is this possible using simple Xpath selectors and an XML parser like Nokogiri? Or do I need to implement a SAX parser for this task?
Updated Answer
Here's a general solution that creates a hierarchy of
<section>
elements based on header levels and their following siblings:Here is this code in use, and the result it gives.
The original HTML
The conversion code
The result
Original Answer
Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).
For XPath, see XPath : select all following siblings until another sibling
XPath can only select things from your input document, it can't transform it into a new document. For that you need XSLT or some other transformation language. I guess if you're into Nokogiri then the previous answers will be useful, but for completeness, here's what it looks like in XSLT 2.0:
One way using xpath is to select all the p elements that follow your h2 and from them subtract the p elements that also follow the next h2: