Normalize space issue with html tags

2019-07-30 04:28发布

问题:

Here's one for you XSLT gurus :-)

I have to deal with XML output from a Java program I cannot control.

In the docs outputted by this app the html tags remain as

<u><i><b><em>  

etc, instead of

&lt;u&gt;&lt;i&gt;&lt;b&gt;&lt;em&gt; and so on.

That's not a massive problem, I use XSLT to fix that, but using normalize-space to remove excess whitespace also removes spaces before these html tags.

Example

<Locator Precode="7">
<Text LanguageId="7">The next word is <b>bold</b> and is correctly spaced 
around the html tag,
but the sentence has extra whitespace and 
line breaks</Text>
</Locator>

If I run the XSLT script we use to remove extra white space, of which this is the relevant part

<xsl:template match="text(.)">
<xsl:value-of select="normalize-space()"/>
</xsl:template>

In the resulting output the xslt has correctly removed the extra whitespace and the line breaks, but it has also removed the space before the tag resulting in this output :-

The next word isboldand is correctly spaced around the html tag, but the sentence has extra whitespace and line breaks.

The spacing before and after the word "bold" has been stripped as well.

Anyone have any ideas how to prevent this from happening? Pretty well at my wits end so any help will be greatly appreciated!

:-)

Hi again,

Yes of course, here's the full stylesheet. We have to deal with the html tags and spacing in one pass

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no" encoding="UTF-8"/>
<xsl:strip-space elements="*" />  


<xsl:template match="@*|node()">
 <xsl:copy> 
  <xsl:apply-templates select="@*|node()"/>
 </xsl:copy>
</xsl:template>


<xsl:template match="Text//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="normalize-space(.)"/>
</xsl:template>


<xsl:template match="Instruction//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>

<xsl:template match="Title//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>


</xsl:stylesheet>

回答1:

An XSLT 1.0 solution is an XPath expression to replace a sequence of several whitespace characters with a single one. The idea is not my own, it is taken from an answer by Dimitre Novatchev.

The advantage over the built-in normalize-space() function is that trailing whitespace (in your case, before and after the b element) is kept.

EDIT: As a response to you editing your question. Below is the said XPath expression incorporated into your stylesheet. Also:

  • Explicitly saying omit-xml-declaration="no" is redundant. It is the default action taken by the XSLT processor
  • Several of your templates have the same content. I summarized them using | to a single one.

Stylesheet

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8"/>
<xsl:strip-space elements="*" />  


<xsl:template match="@*|node()">
 <xsl:copy> 
  <xsl:apply-templates select="@*|node()"/>
 </xsl:copy>
</xsl:template>


<xsl:template match="Text//*|Instruction//*|Title//*">
  <xsl:value-of select="concat('&lt;',name(),'&gt;')" />
  <xsl:apply-templates />
  <xsl:value-of select="concat('&lt;/',name(),'&gt;')" />
</xsl:template>

<xsl:template match="text()">
  <xsl:value-of select=
  "concat(substring(' ', 1 + not(substring(.,1,1)=' ')),
          normalize-space(),
          substring(' ', 1 + not(substring(., string-length(.)) = ' '))
          )
  "/>
  </xsl:template>

</xsl:stylesheet>

XML Output

<?xml version="1.0" encoding="UTF-8"?>
<Locator Precode="7">
   <Text LanguageId="7">The next word is &lt;b&gt;bold&lt;/b&gt; and is correctly spaced around the html tag, but the sentence has extra whitespace and line breaks</Text>
</Locator>