How do I strip accents from characters in XSL?

2019-05-06 07:08发布

问题:

I keep looking, but can't find an XSL function that is the equivalent of "normalize-space", for characters. That is, my content has accented UNICODE characters, which is great, but from that content, I'm creating a filename, where I don't want those accents.

So, is there something that I'm overlooking, or not googling properly, to easily process characters?

In the XML data:

<filename>gri_gonéwiththèw00mitc</filename>

In XSLT stylesheet:

<xsl:variable name="file">
    <xsl:value-of select="filename"/>
</xsl:variable>

<xsl:value-of select="$file"/>

results in "gri_gonéwiththèw00mitc"

where

<xsl:value-of select='replace( normalize-unicode( "$file", "NFKD" ), "[^\\p{ASCII}]", "" )'/>

results in nothing.

What I'm aiming for is gri_gonewiththew00mitc (no accents)

Am I using the syntax wrong?

回答1:

In XSLT/XPath 1.0 if you want to replace those accented characters with the unaccented counterpart, you could use translate() function.

But, that assumes your "accented UNICODE characters" aren't composed unicode characters. If that were the case, you would need to use XPath 2.0 normalize-unicode() function.

And, if the real goal is to have a valid URI, you should use encode-for-uri()

Update: Examples

translate('gri_gonéwiththèw00mitc','áàâäéèêëíìîïóòôöúùûü','aaaaeeeeiiiioooouuuu')

Result: gri_gonewiththew00mitc

encode-for-uri('gri_gonéwiththèw00mitc')

Result: gri_gon%C3%A9withth%C3%A8w00mitc

Correct expression provide suggest by @biziclop:

replace(normalize-unicode('gri_gonéwiththèw00mitc','NFKD'),'\P{ASCII}','')

Result: gri_gonewiththew00mitc

Note: In XPath 2.0, the correct character class negation is with a capital \P.



回答2:

So, contrary to my comment, you could try this:

replace( normalize-unicode( "öt hűtőházból kértünk színhúst", "NFKD" ), "[^\\p{ASCII}]", "" )

Although be warned that any characters which can't be decomposed and aren't basic ASCII (Norwegian ø or Icelandic Þ for example) will be completely deleted from the string, but that's probably okay with your requirements.



回答3:

The previously suggested ways contain unknownthe character class named 'ASCII'. In my experience, XPath 2.0 recognises the class 'BasicLatin', which should serve the same purpose as 'ASCII'.

replace(normalize-unicode('Lliç d'Am Oükl Úkřeč', 'NFKD'), '\P{IsBasicLatin}', '')