eXist-db / XSLT / Saxon collection() slow as molas

2019-08-17 11:58发布

问题:

Coming from this question, I managed one entirely unsatisfactory solution for accessing an eXist-DB collection() from an XSLT 2.0 document loaded from within an eXist-db/Xquery transformation function:

The XSLT file declares a variable :

 <xsl:variable name="coll" select="collection('xmldb:exist:///db/apps/deheresi/data/collection_ms609.xml')"/>

This points to a catalog xml file I created (per Saxon documentation) that looks like this, in order to load the actual collection:

<collection stable="true">
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0001.xml"/>
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0002.xml"/>
  ...
  ...
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0709.xml"/>
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0710.xml"/>
</collection>

This allows the XSLT file to use a key that needs to search across all these files:

<xsl:key name="correspkey" match="tei:seg[@type='dep_event' and @corresp]" use="@corresp"/>

<xsl:variable name="correspvar" select="self::seg[@type='dep_event' and @corresp]/@corresp"/>

<xsl:value-of select="$coll/(key('correspid',$correspvar) except $correspvar)/@id" separator=", "/>

As it stands, if I have 50 documents in the catalog, I get a result in 2 minutes; with all 710 I get a java GC error after 4 minutes.

I have set indexes on relevant nodes in eXist-DB, but this does nothing to performance. It seems to me Saxon is working 'outside' eXist-DB's optimisations, treating eXist-DB as a simple file system.

(For what it's worth, setting href="/db/apps/deheresi/data/ms609_0001.xml" does not let Saxon see the documents.)

I suspect all of this is why the eXist-DB documentation is non-existent.

As it goes, I am looking for solutions for intensive searches of collections from within XSLT 2.0 loaded within eXist-DB by Xquery transform().

If anything, I hope this post helps future searchers encountering the same problem.

回答1:

The general architectural principle is: try to move the searching closer to the data. In this case this means: use eXist to find the documents of interest, don't extract every possible candidate document from eXist and then ask Saxon to do the searching. Select the actual documents of interest in an eXist XQuery, and then pass the list of these documents to Saxon in a stylesheet parameter.