How to split text and preserve HTML tags (XSLT 2.0

2019-05-07 19:25发布

I have an xml that has a description node:

<config>
  <desc>A <b>first</b> sentence here. The second sentence with some link <a href="myurl">The link</a>. The <u>third</u> one.</desc>
</config>

I am trying to split the sentences using dot as separator but keeping in the same time in the HTML output the eventual HTML tags. What I have so far is a template that splits the description but the HTML tags are lost in the output due to the normalize-space and substring-before functions. My current template is given below:

<xsl:template name="output-tokens">
  <xsl:param name="sourceText" />

  <!-- Force a . at the end -->
  <xsl:variable name="newlist" select="concat(normalize-space($sourceText), ' ')" />
  <!-- Check if we have really a point at the end -->
  <xsl:choose>
    <xsl:when test ="contains($newlist, '.')">
      <!-- Find the first . in the string -->
      <xsl:variable name="first" select="substring-before($newlist, '.')" />

      <!-- Get the remaining text -->
      <xsl:variable name="remaining" select="substring-after($newlist, '.')" />
      <!-- Check if our string is not in fact a . or an empty string -->
      <xsl:if test="normalize-space($first)!='.' and normalize-space($first)!=''">
        <p><xsl:value-of select="normalize-space($first)" />.</p>
      </xsl:if>
      <!-- Recursively apply the template for the remaining text -->
      <xsl:if test="$remaining">
        <xsl:call-template name="output-tokens">
          <xsl:with-param name="sourceText" select="$remaining" />
        </xsl:call-template>
      </xsl:if>
    </xsl:when>
    <!--If no . was found -->
    <xsl:otherwise>
      <p>
        <!-- If the string does not contains a . then display the text but avoid 
           displaying empty strings 
         -->
        <xsl:if test="normalize-space($sourceText)!=''">
          <xsl:value-of select="normalize-space($sourceText)" />.
        </xsl:if>
      </p>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

and I am using it in the following manner:

<xsl:template match="config">
  <xsl:call-template name="output-tokens">
       <xsl:with-param name="sourceText" select="desc" />
  </xsl:call-template>
</xsl:template>

The expected output is:

<p>A <b>first</b> sentence here.</p>
<p>The second sentence with some link <a href="myurl">The link</a>.</p>
<p>The <u>third</u> one.</p>

4条回答
淡お忘
2楼-- · 2019-05-07 19:55
地球回转人心会变
3楼-- · 2019-05-07 20:04

A good question, and not an easy one to solve. Especially, of course, if you're using XSLT 1.0 (you really need to tell us if that's the case).

I've seen two approaches to the problem. Both involve breaking it into smaller problems.

The first approach is to convert the markup into text (for example replace <b>first</b> by [b]first[/b]), then use text manipulation operations (xsl:analyze-string) to split it into sentences, and then reconstitute the markup within the sentences.

The second approach (which I personally prefer) is to convert the text delimiters into markup (convert "." to <stop/>) and then use positional grouping techniques (typically <xsl:for-each-group group-ending-with="stop"/> to convert the sentences into paragraphs.)

查看更多
Explosion°爆炸
4楼-- · 2019-05-07 20:14

Here is one way to implement the second approach suggested by Michael Kay using XSLT 2.

This stylesheet demonstrates a two-pass transformation where the first pass introduces <stop/> markers after each sentence and the second pass encloses all groups ending with a <stop/> in a paragraph.

<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes"/>

  <!-- two-pass processing -->
  <xsl:template match="/">
    <xsl:variable name="intermediate">
      <xsl:apply-templates mode="phase-1"/>
    </xsl:variable>
    <xsl:apply-templates select="$intermediate" mode="phase-2"/>
  </xsl:template>

  <!-- identity transform -->
  <xsl:template match="@*|node()" mode="#all" priority="-1">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" mode="#current"/>
    </xsl:copy>
  </xsl:template>

  <!-- phase 1 -->

  <!-- insert <stop/> "milestone markup" after each sentence -->
  <xsl:template match="text()" mode="phase-1">
    <xsl:analyze-string select="." regex="\.\s+">
      <xsl:matching-substring>
        <xsl:value-of select="regex-group(0)"/>
        <stop/>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

  <!-- phase 2 -->

  <!-- turn each <stop/>-terminated group into a paragraph -->
  <xsl:template match="*[stop]" mode="phase-2">
    <xsl:copy>
      <xsl:for-each-group select="node()" group-ending-with="stop">
        <p>
          <xsl:apply-templates select="current-group()" mode="#current"/>
        </p>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <!-- remove the <stop/> markers -->
  <xsl:template match="stop" mode="phase-2"/>

</xsl:stylesheet>
查看更多
家丑人穷心不美
5楼-- · 2019-05-07 20:15

This is my humble solution, based on the second suggestion of @Michael Kay answer.

Differently from @Jukka answer (which is very elegant indeed) I'm not using xsl:analyse-string, as XPath 1.0 functions contains and substring-after are enough to accomplish the split. I've also started the match pattern from the config.

Here's the transform:

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <!-- two pass processing -->
    <xsl:template match="config">
        <xsl:variable name="pass1">
            <xsl:apply-templates select="node()"/>
        </xsl:variable>
        <xsl:apply-templates mode="pass2" select="$pass1/*"/>
    </xsl:template>

    <!-- 1. Copy everything as is (identity) -->
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <!-- 1. Replace "text. text" with "text<dot/> text" -->
    <xsl:template match="text()[contains(.,'. ')]">
        <xsl:value-of select="substring-before(.,'. ')"/>
        <dot/>
        <xsl:value-of select="substring-after(.,'. ')"/>
    </xsl:template>

    <!-- 2. Group by examining in population order ending with dot -->
    <xsl:template match="desc" mode="pass2">
        <xsl:for-each-group select="node()" 
            group-ending-with="dot">
            <p><xsl:apply-templates select="current-group()" mode="pass2"/></p>
        </xsl:for-each-group>
    </xsl:template>

    <!-- 2. Identity -->
    <xsl:template match="node()|@*" mode="pass2">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*" mode="pass2"/>
        </xsl:copy>
    </xsl:template>

    <!-- 2. Replace dot with mark -->
    <xsl:template match="dot" mode="pass2">
        <xsl:text>.</xsl:text>
    </xsl:template>

</xsl:stylesheet>

Applied on the input shown in your question, produces:

<p>A <b>first</b> sentence here.</p>
<p>The second sentence with some link <a href="myurl">The link</a>.</p>
<p>The <u>third</u> one.</p>
查看更多
登录 后发表回答