So I am busy creating a XSLT file to process various XML documents into a new node layout.
There's one thing I can't figure out, here is an example of XML that I'm working with:
<page>
This is a paragraph on the page.
<newParagraph/>
This is another paragraph.
<newParagraph/>
Here is yet another paragraph on this page.
<page>
As you can see the paragraphs are split up using empty tags as deviders. In the result XML I want this:
<page>
<p>
This is a paragraph on the page.
</p>
<p>
This is another paragraph.
</p>
<p>
Here is yet another paragraph on this page.
</p>
<page>
How can I achieve this using XSLT (Version 1.0 only)?
This is more or less a duplicate of another question, so the same approach will work:
<xsl:template match="pages">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="page/text()">
<p><xsl:value-of select="."/></p>
</xsl:template>
<xsl:template match="NewParagraph" />
Simple and clean. Hope it helps
The following answer is not as elegant as @stwissel's but it will correctly any tag sub trees in the paragraphs. It did become a little nasty, indeed. :-)
The problem with this task is that it requires special handling of what is between a closing tag and following matching opening tag (e.g. <tag></tag>
). XSLT, however, is optimized for handling what is between and an opening tag and a matching closing tag (e.g. </tag><tag>
). By the way: There's a way to "cheat" a little bit. See my other answer to this question.
Suppose you have an input XML as follows:
<pages>
<page>
This is a paragraph on the page.
<B>bold</B>
After Bold
<newParagraph/>
This is another paragraph.
<newParagraph/>
Here is yet another paragraph on this page.
<EM>
<B>
Bold and emphasized.
</B>
</EM>
After bold and emphasized.
</page>
<page>
Another page.
</page>
</pages>
It can be processed using the this XSLT 1.0 transformation
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="page">
<page>
<!-- handle the first paragraph up to the first newParagraph -->
<P>
<xsl:apply-templates select="node()[not(preceding-sibling::newParagraph)]" />
</P>
<!-- now handle all remaining paragraphs of the page -->
<xsl:for-each select="newParagraph">
<xsl:variable name="pCount" select="position()"/>
<P>
<xsl:apply-templates select="following-sibling::node()[count(preceding-sibling::newParagraph) <= $pCount]" />
</P>
</xsl:for-each>
</page>
</xsl:template>
<!-- this default rule recursively copies all substructures within a paragraph at tag level -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!-- this default rule makes sure that texts between the tags are printed -->
<xsl:template match="text()">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="newParagraph"/>
</xsl:stylesheet>
producing this output
<pages>
<page><P>
This is a paragraph on the page.
<B>bold</B>
After Bold
</P><P>
This is another paragraph.
</P><P>
Here is yet another paragraph on this page.
<EM>
<B>
Bold and emphasized.
</B>
</EM>
After bold and emphasized.
</P></page>
<page><P>
Another page.
</P></page>
</pages>
If you are willing to "cheat" a little bit you can manually insert XML tags into result document which are not part of the node tree but which are normal text. A processor downstream, however, will not notice the difference provided that it re-parses the output.
Given the input of my other answer the following XSLT 1.0 transformation will do the trick (preserving the sub trees in the paragraphs):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="page">
<page>
<P>
<xsl:apply-templates/>
</P>
</page>
</xsl:template>
<!-- this default rule recursively copies all substructures within a paragraph at tag level -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!-- this default rule makes sure that texts between the tags are printed -->
<xsl:template match="text()">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="newParagraph">
<!-- This inserts a matching closing and opening tag -->
<xsl:value-of select="'</P><P>'" disable-output-escaping="yes" />
</xsl:template>
</xsl:stylesheet>