XSLT works too slow

2019-05-23 11:12发布

问题:

I have around 100 XML files which I want to transform into another file with a better structure. This example takes it into CSV, but I have also a variant that transforms it into better XML. Format is not that relevant for me. I see there are tons of questions like this, but I find the examples hard to adapt as the problem is not that the stylesheet wouldn't work but that it is too slow.

The sizes of my data files are between 4-12 MB. The XSLT I have provided here works well with small files. As an example, when I cut a file to 250 KB piece the stylesheet processes it well (though this takes already around 30 seconds). When I try it to the actual larger data file it just never seems to finish the job - not even with one file. I have Oxygen XML Editor, I've been using Saxon-HE 9.5.1.2 for the transformation.

One remark: this can still be slowish. I can leave my computer to do it for overnight or something. This concerns one malformed dataset and I don't need to repeat this transformation often at all.

So my question is:

Is there something in this XSLT that makes it work particularly slowly? Would some other approch work better?

These are simplified working examples. The actual data files are structurally identical, but have more nodes which I called "words" in this example. The attribute type specifies which nodes I'm after. It is linguistic dialect data with dialectal words and their normalized versions.

This is the XML.

<?xml version="1.0" encoding="UTF-8"?>
<xml>
<order>
    <slot id="ts1" value="1957"/>
    <slot id="ts2" value="1957"/>
    <slot id="ts3" value="2389"/>
    <slot id="ts4" value="2389"/>
    <slot id="ts5" value="2389"/>
    <slot id="ts6" value="2389"/>
    <slot id="ts7" value="3252"/>
    <slot id="ts8" value="3252"/>
    <slot id="ts9" value="3252"/>
    <slot id="ts10" value="3360"/>
</order>
<words type="original word">
    <annotation>
        <data id_1="ts1" id_2="ts3">
            <text>dialectal_word_1</text>
        </data>
    </annotation>
    <annotation>
        <data id_1="ts4" id_2="ts7">
            <text>dialectal_word_2</text>
        </data>
    </annotation>
    <annotation>
        <data id_1="ts8" id_2="ts10">
            <text>,</text>
        </data>
    </annotation>
</words>
<words type="normalized word">
    <annotation>
        <data id_1="ts2" id_2="ts5">
            <text>normalized_word_1</text>
        </data>
    </annotation>
    <annotation>
        <data id_1="ts6" id_2="ts9">
            <text>normalized_word_2</text>
        </data>
    </annotation>
</words>
</xml>

This is the XSLT. What it attempts to do is to pick up the pairs which have matching values up in the XML structure.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="text" encoding="UTF-8" indent="yes"/>
<xsl:template match="/xml">
    <xsl:text>original&#x9;normalized
</xsl:text>
        <xsl:for-each select="words[@type='original word']/annotation/data">
            <xsl:sort select="substring-after(@id_1, 'ts')" data-type="number"/>
            <xsl:variable name="origStartTimeId" select="@id_1"/>
            <xsl:variable name="origEndTimeId" select="@id_2"/>
            <xsl:variable name="origStartTime_VALUE" select="/xml/order/slot[@id=$origStartTimeId]/@value"/>
            <xsl:variable name="origEndTime_VALUE" select="/xml/order/slot[@id=$origEndTimeId]/@value"/>
                    <xsl:value-of select="text"/>
            <xsl:text>&#x9;</xsl:text>    
                <xsl:for-each select="/xml/words[@type='normalized word']/annotation/data">
                    <xsl:variable name="normStartTime" select="@id_1"/>
                    <xsl:variable name="normEndTime" select="@id_2"/>
                    <xsl:variable name="normStartTime_VALUE" select="/xml/order/slot[@id=$normStartTime]/@value"/>
                    <xsl:variable name="normEndTime_VALUE" select="/xml/order/slot[@id=$normEndTime]/@value"/>
                    <xsl:if test="($normStartTime_VALUE = $origStartTime_VALUE) and ($normEndTime_VALUE = $origEndTime_VALUE)">
                            <xsl:value-of select="text"/>    
                    </xsl:if>
                </xsl:for-each>
            <xsl:text>
</xsl:text>
        </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

What is outputs is simply this:

original    normalized
dialectal_word_1    normalized_word_1
dialectal_word_2    normalized_word_2
,   

And that would be fine for me.

Thanks!

回答1:

The double nested for-each in your current stylesheet is inefficient and will get worse as the size of the file grows - you've got (number of original words)*(number of normalized words) iterations, essentially quadratic complexity (assuming there's roughly the same number of original and normalized words in the file). You can do much better if you use keys, which work by building a lookup table that you can use to find nodes very quickly (typically in constant rather than linear time).

<!-- I've said version="2.0" to match your stylesheet in the question, but this
     code is actually valid XSLT 1.0 as it doesn't use any 2.0-specific features
     or functions -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="text" encoding="UTF-8" indent="yes"/>

  <!-- first key to look up slot elements by their id -->
  <xsl:key name="slotById" match="slot" use="@id" />
  <!-- second key to look up normalized word annotations by the value of their slots -->
  <xsl:key name="annotationBySlots" match="words[@type='normalized word']/annotation"
           use="concat(key('slotById', data/@id_1)/@value, '|',
                       key('slotById', data/@id_2)/@value)" />

  <xsl:template match="/xml">
    <xsl:text>original&#x9;normalized&#xA;</xsl:text>
    <xsl:apply-templates select="words[@type = 'original word']/annotation" />
  </xsl:template>

  <xsl:template match="annotation">
    <xsl:value-of select="data/text" />
    <xsl:text>&#x9;</xsl:text>
    <xsl:value-of select="
            key('annotationBySlots',
                concat(key('slotById', data/@id_1)/@value, '|',
                       key('slotById', data/@id_2)/@value)
            )/data/text" />
    <xsl:text>&#xA;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

This should run in linear time (one "iteration" per original word annotation, plus the time taken to build the lookup tables which again should be linear in the number of slots plus the number of normalized word annotations).



回答2:

Constructs like /xml/order/slot[@id=$origStartTimeId] ask for defining a key <xsl:key name="slot-by-id" match="xml/order/slot" use="@id"/> and then using key('slot-by-id', $origStartTimeId) instead of /xml/order/slot[@id=$origStartTimeId]. Make the same change in all places and I am sure performances increases.