Wrapping words from HTML using XSL

2019-07-15 13:47发布

I need wrapping each word with a tag (e. span) in a HTML document, like:

<html>
<head>
    <title>It doesnt matter</title>
</head>
<body>
         <div> Text in a div </div>
         <div>
    Text in a div
    <p>
        Text inside a p
    </p>
     </div>
</body>
</html>

To result something like this:

<html>
<head>
    <title>It doesnt matter</title>
</head>
<body>
         <div> <span>Text </span> <span> in </span> <span> a </span> <span> div </span> </div>
         <div>

             <span>Text </span> <span> in </span> <span> a </span> <span> div </span>                     
             <p>
               <span>Text </span> <span> in </span> <span> a </span> <span> p </span> 
             </p>
     </div>
</body>
</html>

It's important to keep the structure of the body...

Any help?

标签: html xslt
2条回答
Explosion°爆炸
2楼-- · 2019-07-15 14:02

All of the three different solutions below use the XSLT design pattern of overriding the identity rule to generally preserve the structure and contents of the XML document, and only modify specific nodes.

I. XSLT 1.0 solution:

This short and simple transformation (no <xsl:choose> used anywhere):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="*[not(self::title)]/text()"
               name="split">
  <xsl:param name="pText" select=
       "concat(normalize-space(.), ' ')"/>

  <xsl:if test="string-length(normalize-space($pText)) >0">
   <span>
   <xsl:value-of select=
        "substring-before($pText, ' ')"/>
   </span>

   <xsl:call-template name="split">
    <xsl:with-param name="pText"
         select="substring-after($pText, ' ')"/>
   </xsl:call-template>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

when applied to the provided XML document:

<html>
    <head>
        <title>It doesnt matter</title>
    </head>
    <body>
        <div> Text in a div </div>
        <div>
         Text in a div
            <p>
             Text inside a p
         </p>
        </div>
    </body>
</html>

produces the wanted, correct result:

<html>
   <head>
      <title>It doesnt matter</title>
   </head>
   <body>
      <div>
         <span>Text</span>
         <span>in</span>
         <span>a</span>
         <span>div</span>
      </div>
      <div>
         <span>Text</span>
         <span>in</span>
         <span>a</span>
         <span>div</span>
         <p>
            <span>Text</span>
            <span>inside</span>
            <span>a</span>
            <span>p</span>
         </p>
      </div>
   </body>
</html>

II. XSLT 2.0 solution:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="*[not(self::title)]/text()">
  <xsl:for-each select="tokenize(., '[\s]')[.]">
   <span><xsl:sequence select="."/></span>
  </xsl:for-each>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied to the same XML document (above), again the correct, wanted result is produced:

<html>
   <head>
      <title>It doesnt matter</title>
   </head>
   <body>
      <div>
         <span>Text</span>
         <span>in</span>
         <span>a</span>
         <span>div</span>
      </div>
      <div>
         <span>Text</span>
         <span>in</span>
         <span>a</span>
         <span>div</span>
         <p>
            <span>Text</span>
            <span>inside</span>
            <span>a</span>
            <span>p</span>
         </p>
      </div>
   </body>
</html>

III Solution using FXSL:

Using the str-split-to-words template/function of FXSL one can easily implement much more complicated tokenization -- in any version of XSLT:

Let's have a more complicated XML document and tokenization rules:

<html>
    <head>
        <title>It doesnt matter</title>
    </head>
    <body>
        <div> Text: in a div </div>
        <div>
         Text; in; a. div
            <p>
             Text- inside [a] [p]
         </p>
        </div>
    </body>
</html>

Here there is more than one delimiter that indicates the start or end of a word. In this particular example the delimiters can be: " ", ";", ".", ":", "-", "[", "]".

The following transformation uses FXSL for this more complicated tokenization:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ext="http://exslt.org/common"
 exclude-result-prefixes="ext">

   <xsl:import href="strSplit-to-Words.xsl"/>

   <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
   <xsl:strip-space elements="*"/>

    <xsl:template match="node()|@*">
        <xsl:copy>
          <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::title)]/text()">
      <xsl:variable name="vwordNodes">
        <xsl:call-template name="str-split-to-words">
          <xsl:with-param name="pStr" select="normalize-space(.)"/>
          <xsl:with-param name="pDelimiters" 
                          select="' ;.:-[]'"/>
        </xsl:call-template>
      </xsl:variable>

      <xsl:apply-templates select="ext:node-set($vwordNodes)/*"/>
    </xsl:template>

    <xsl:template match="word[string-length(normalize-space(.)) > 0]">
      <span>
        <xsl:value-of select="."/>
      </span>
    </xsl:template>
</xsl:stylesheet>

and produces the wanted, correct result:

<html>
   <head>
      <title>It doesnt matter</title>
   </head>
   <body>
      <div>
         <span>Text</span>
         <span>in</span>
         <span>a</span>
         <span>div</span>
      </div>
      <div>
         <span>Text</span>
         <span>in</span>
         <span>a</span>
         <span>div</span>
         <p>
            <span>Text</span>
            <span>inside</span>
            <span>a</span>
            <span>p</span>
            <word/>
         </p>
      </div>
   </body>
</html>
查看更多
趁早两清
3楼-- · 2019-07-15 14:16

You could achieve this by extending the identity transform to include a recursive template which checks for spaces in a piece of text, and if so puts a span tag around the first word. It can then recursively calls itself for the remaining portion of the text.

Here is it in action...

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:output method="html" indent="yes"/>

   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
   </xsl:template>

   <!-- Don't split the words in the title -->
   <xsl:template match="title">
      <xsl:copy-of select="." />
   </xsl:template>

   <!-- Matches a text element. Given a name so it can be recursively called -->
   <xsl:template match="text()" name="wrapper">
      <xsl:param name="text" select="." />
      <xsl:variable name="new" select="normalize-space($text)" />
      <xsl:choose>
         <xsl:when test="contains($new, ' ')">
            <span><xsl:value-of select="concat(substring-before($new, ' '), ' ')" /></span>
            <xsl:call-template name="wrapper">
               <xsl:with-param name="text" select="substring-after($new, ' ')" />
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <span><xsl:value-of select="$new" /></span>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
</xsl:stylesheet>

When called on your sample HTML, the output is as follows:

<html>
   <head>
      <title>It doesnt matter</title>
   </head>
   <body>
      <div>
         <span>Text </span>
         <span>in </span>
         <span>a </span>
         <span>div</span>
      </div>
      <div>
         <span>Text </span>
         <span>in </span>
         <span>a </span>
         <span>div</span>
         <p>
            <span>Text </span>
            <span>inside </span>
            <span>a </span>
            <span>p</span>
         </p>
      </div>
   </body>
</html>

I wasn't 100% sure how important the spaces within the span elements are for you though.

查看更多
登录 后发表回答