XSLT: Reading content that is devided by empty tag

2019-04-15 13:09发布

问题:

So I am busy creating a XSLT file to process various XML documents into a new node layout.

There's one thing I can't figure out, here is an example of XML that I'm working with:

<page>
   This is a paragraph on the page.
    <newParagraph/>
   This is another paragraph.
    <newParagraph/>
   Here is yet another paragraph on this page.
<page>

As you can see the paragraphs are split up using empty tags as deviders. In the result XML I want this:

<page>
   <p>
    This is a paragraph on the page.
   </p>
   <p> 
    This is another paragraph.
   </p>
   <p>
   Here is yet another paragraph on this page.
   </p>
<page>

How can I achieve this using XSLT (Version 1.0 only)?

回答1:

This is more or less a duplicate of another question, so the same approach will work:

<xsl:template match="pages">
    <xsl:apply-templates />
</xsl:template>

<xsl:template match="page/text()">
    <p><xsl:value-of select="."/></p>
</xsl:template>

<xsl:template match="NewParagraph" />

Simple and clean. Hope it helps



回答2:

The following answer is not as elegant as @stwissel's but it will correctly any tag sub trees in the paragraphs. It did become a little nasty, indeed. :-)

The problem with this task is that it requires special handling of what is between a closing tag and following matching opening tag (e.g. <tag></tag>). XSLT, however, is optimized for handling what is between and an opening tag and a matching closing tag (e.g. </tag><tag>). By the way: There's a way to "cheat" a little bit. See my other answer to this question.

Suppose you have an input XML as follows:

<pages>
  <page>
    This is a paragraph on the page.
    <B>bold</B>
    After Bold
    <newParagraph/>
    This is another paragraph.
    <newParagraph/>
    Here is yet another paragraph on this page.
    <EM>
      <B>
        Bold and emphasized.
      </B>
    </EM>
    After bold and emphasized.
  </page>
  <page>
    Another page.
  </page>
</pages>

It can be processed using the this XSLT 1.0 transformation

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />

  <xsl:template match="page">
    <page>
      <!-- handle the first paragraph up to the first newParagraph -->
      <P>
        <xsl:apply-templates select="node()[not(preceding-sibling::newParagraph)]" />
      </P>

      <!-- now handle all remaining paragraphs of the page -->
      <xsl:for-each select="newParagraph">
        <xsl:variable name="pCount" select="position()"/>
        <P>
          <xsl:apply-templates select="following-sibling::node()[count(preceding-sibling::newParagraph) &lt;= $pCount]" />
        </P>
      </xsl:for-each>
    </page>
  </xsl:template>

  <!-- this default rule recursively copies all substructures within a paragraph at tag level -->  
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>


  <!-- this default rule makes sure that texts between the tags are printed -->
  <xsl:template match="text()">
    <xsl:copy-of select="."/>
  </xsl:template>

  <xsl:template match="newParagraph"/>

</xsl:stylesheet>

producing this output

<pages>
  <page><P>
    This is a paragraph on the page.
    <B>bold</B>
    After Bold
    </P><P>
    This is another paragraph.
    </P><P>
    Here is yet another paragraph on this page.
    <EM>
      <B>
        Bold and emphasized.
      </B>
    </EM>
    After bold and emphasized.
  </P></page>
  <page><P>
    Another page.
  </P></page>
</pages>


回答3:

If you are willing to "cheat" a little bit you can manually insert XML tags into result document which are not part of the node tree but which are normal text. A processor downstream, however, will not notice the difference provided that it re-parses the output.

Given the input of my other answer the following XSLT 1.0 transformation will do the trick (preserving the sub trees in the paragraphs):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />

  <xsl:template match="page">
    <page>
      <P>
        <xsl:apply-templates/>
      </P>
    </page>
  </xsl:template>

  <!-- this default rule recursively copies all substructures within a paragraph at tag level -->  
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>


  <!-- this default rule makes sure that texts between the tags are printed -->
  <xsl:template match="text()">
    <xsl:copy-of select="."/>
  </xsl:template>

  <xsl:template match="newParagraph">
    <!-- This inserts a matching closing and opening tag -->
    <xsl:value-of select="'&lt;/P&gt;&lt;P&gt;'" disable-output-escaping="yes" />
  </xsl:template>

</xsl:stylesheet>