I am trying to use XSLT 2.0 (Saxon-PE 9.6) on an HTML document to create tags that surround all contiguous runs of characters from a specified non-Latin Unicode block (spaces allowed). I need to apply this process to every text() node in the document. I have made some progress with two approaches that use <xsl:analyze-string>
and using fn:replace()
but I've not been able to arrive at a satisfactory and complete solution.
For example, here is some text containing Hindi:
Input: <p>चाय का कप means ‘cup of tea’ in हिन्दि.</p>
Desired Output: <p><span xml:lang="hi-Deva">चाय का कप</span> means ‘cup of tea’ in <span xml:lang="hi-Deva">हिन्दि</span>.</p>
How can this process be implemented in XSLT 2.0?
Here's my attempt with <xsl:analyze-string>
:
(Note: the Hindi language uses the Devanagari code block U+0900 to U+097F.)
<xsl:template match="text()">
<xsl:variable name="textValue" select="."/>
<xsl:analyze-string select="$textValue" regex="(\s*.*?)([ऀ-ॿ]+)((\s+[ऀ-ॿ]+)*)(\s*.*)">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
<span xml:lang="hi-Deva"><xsl:value-of select="regex-group(2)"/><xsl:value-of select="regex-group(3)"/></span>
<xsl:value-of select="regex-group(5)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="$textValue"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
On the test input, this produces:
<p><span xml:lang="hi-Deva">चाय का कप</span> means ‘cup of tea’ in हिन्दि.</p>
This approach misses the second region of Hindi text (हिन्दि). I need an approach that will find and tag all occurrences matched by the regex.
My second approach used fn:replace()
:
<xsl:template match="text()">
<xsl:value-of select='fn:replace(., "[ऀ-ॿ]+(\s+[ऀ-ॿ]+)*", "xxx$0xxx")'/>
</xsl:template>
On the test input this produces: <p>xxxचाय का कपxxx means ‘cup of tea’ in xxxहिन्दिxxx.</p>
This is clearly incorrect, since the Hindi is wrapped in xxx’s, not span tags, but on the positive side, each region of Hindi is in fact discovered and processed. I cannot replace the xxx code with span tags because that is invalid XSLT.