I have around 100 XML files which I want to transform into another file with a better structure. This example takes it into CSV, but I have also a variant that transforms it into better XML. Format is not that relevant for me. I see there are tons of questions like this, but I find the examples hard to adapt as the problem is not that the stylesheet wouldn't work but that it is too slow.
The sizes of my data files are between 4-12 MB. The XSLT I have provided here works well with small files. As an example, when I cut a file to 250 KB piece the stylesheet processes it well (though this takes already around 30 seconds). When I try it to the actual larger data file it just never seems to finish the job - not even with one file. I have Oxygen XML Editor, I've been using Saxon-HE 9.5.1.2 for the transformation.
One remark: this can still be slowish. I can leave my computer to do it for overnight or something. This concerns one malformed dataset and I don't need to repeat this transformation often at all.
So my question is:
Is there something in this XSLT that makes it work particularly slowly? Would some other approch work better?
These are simplified working examples. The actual data files are structurally identical, but have more nodes which I called "words" in this example. The attribute type specifies which nodes I'm after. It is linguistic dialect data with dialectal words and their normalized versions.
This is the XML.
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<order>
<slot id="ts1" value="1957"/>
<slot id="ts2" value="1957"/>
<slot id="ts3" value="2389"/>
<slot id="ts4" value="2389"/>
<slot id="ts5" value="2389"/>
<slot id="ts6" value="2389"/>
<slot id="ts7" value="3252"/>
<slot id="ts8" value="3252"/>
<slot id="ts9" value="3252"/>
<slot id="ts10" value="3360"/>
</order>
<words type="original word">
<annotation>
<data id_1="ts1" id_2="ts3">
<text>dialectal_word_1</text>
</data>
</annotation>
<annotation>
<data id_1="ts4" id_2="ts7">
<text>dialectal_word_2</text>
</data>
</annotation>
<annotation>
<data id_1="ts8" id_2="ts10">
<text>,</text>
</data>
</annotation>
</words>
<words type="normalized word">
<annotation>
<data id_1="ts2" id_2="ts5">
<text>normalized_word_1</text>
</data>
</annotation>
<annotation>
<data id_1="ts6" id_2="ts9">
<text>normalized_word_2</text>
</data>
</annotation>
</words>
</xml>
This is the XSLT. What it attempts to do is to pick up the pairs which have matching values up in the XML structure.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="text" encoding="UTF-8" indent="yes"/>
<xsl:template match="/xml">
<xsl:text>original	normalized
</xsl:text>
<xsl:for-each select="words[@type='original word']/annotation/data">
<xsl:sort select="substring-after(@id_1, 'ts')" data-type="number"/>
<xsl:variable name="origStartTimeId" select="@id_1"/>
<xsl:variable name="origEndTimeId" select="@id_2"/>
<xsl:variable name="origStartTime_VALUE" select="/xml/order/slot[@id=$origStartTimeId]/@value"/>
<xsl:variable name="origEndTime_VALUE" select="/xml/order/slot[@id=$origEndTimeId]/@value"/>
<xsl:value-of select="text"/>
<xsl:text>	</xsl:text>
<xsl:for-each select="/xml/words[@type='normalized word']/annotation/data">
<xsl:variable name="normStartTime" select="@id_1"/>
<xsl:variable name="normEndTime" select="@id_2"/>
<xsl:variable name="normStartTime_VALUE" select="/xml/order/slot[@id=$normStartTime]/@value"/>
<xsl:variable name="normEndTime_VALUE" select="/xml/order/slot[@id=$normEndTime]/@value"/>
<xsl:if test="($normStartTime_VALUE = $origStartTime_VALUE) and ($normEndTime_VALUE = $origEndTime_VALUE)">
<xsl:value-of select="text"/>
</xsl:if>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
What is outputs is simply this:
original normalized
dialectal_word_1 normalized_word_1
dialectal_word_2 normalized_word_2
,
And that would be fine for me.
Thanks!