Lowercase part of a string with XPath regular expr

2019-06-05 01:12发布

问题:

In a node, a string might contain one or more substrings delimited by single or double quotes. For example

<node>Some text "and Some" More</node>

What I have to do is lowercase the text that is not surrounded by quotes, so the result should look as:

some text "and Some" more

I've tried two things:

  1. with replace: replace('Some text "and Some" More', '"([^"]*)"', '*') this will replace the text in double quotes with *. But how can I lowercase it? This doesn't produce the desired result: replace('Some text "and Some" More', '"([^"]*)"', lower-case('$1'))
  2. with tokenize: for $t in tokenize('Some text "and Some" More', '"') return $t. Since my node will not start with ", I know the odd entries will be the substrings surrounded by quotes. But I don't know how to choose and lower-case only the odd entries. I tried with position() but it returns 1 on each iteration.

Thanks for looking into this. Much appreciated.

回答1:

Here is a single XPath 2.0 expression that processes in the desired way any mixture of quoted and unquoted strings -- in any order:

  string-join(
  (for $str in tokenize(replace(., "(.*?)("".*?"")([^""]*)", "|$1|$2|$3|", "x"),"\|")
     return
      if(not(contains($str, """")))
        then lower-case($str)
        else $str
  ),
  "")

For a comprehensive test, I evaluate the above expression on the following XML document:

<node>Some "Text""and Some" More "Text" XXX "Even More"</node>

The wanted, correct result is produced:

some "Text""and Some" more "Text" xxx "Even More"

XSLT 2.0 verification:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:sequence select=
  'string-join(
  (for $str in tokenize(replace(., "(.*?)("".*?"")([^""]*)", "|$1|$2|$3|", "x"),"\|")
     return
      if(not(contains($str, """")))
        then lower-case($str)
        else $str
  ),
  "")
  '/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the above XML document, the XPath expression is evaluated, and the result of this evaluation is copied to the output:

some "Text""and Some" more "Text" xxx "Even More"

Finally, an XSLT 2.0 solution -- much easier to write and understand:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/*">
  <xsl:analyze-string select="." regex='".*?"'>
   <xsl:non-matching-substring>
     <xsl:sequence select="lower-case(.)"/>
   </xsl:non-matching-substring>
   <xsl:matching-substring><xsl:sequence select="."/></xsl:matching-substring>
  </xsl:analyze-string>
 </xsl:template>
</xsl:stylesheet>


回答2:

Whew.

In case you'd like it the hard way:

concat(translate(substring-before(//node/text(), '"'),'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') ,substring(substring-after(//node/text(), '"'), 1, string-length(substring-after(//node/text(), '"')) - string-length(substring-after(substring-after(//node/text(), '"'), '"')) -1) , translate(substring-after(substring-after(//node/text(), '"'), '"'), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'))

Just replace //node/text() with whatever XPath gets you to the text you want. I just did this for fun, this isn't the "cleanest" (HA!) solution.

You could make it faster by ensuring the node put in is the context node, or give a more direct path to it.



回答3:

In XQuery you can use

string-join(
  for $x at $i  in tokenize('Some text "and Some" More', '"') return
    if ($i mod 2 = 1) then lower-case($x)
    else $x
  , '"')

but xpath, only has a crippled for without at.

In XPath 3 you can use the ! simple map operator (which is kind of like a for, except it sets . and position()):

string-join(
  tokenize('Some text "and Some" More', '"') !
    if (position() mod 2 = 1) then lower-case(.)
    else .
  , '"')

And finally in XPath 2 you can iterate over the index and get the substring for each index:

string-join(
  for $i in 1 to count(tokenize('Some text "and Some" More', '"')) return
    if ($i mod 2 = 1) then lower-case(tokenize('Some text "and Some" More', '"')[$i])
    else tokenize('Some text "and Some" More', '"')[$i]
  , '"')