I have a corpus in TEI-XML which uses a 'master' corpus XML document that then contains, via xi:include
, thousands of other documents. Each of these documents themselves contain xi:include
s to master lists of named entities (people, places, etc linked by xml:id
s) . All of this works very well in XSLT (and in my IDE Oxygen for fast encoding).
I am now embarking on building a website using eXist-DB applications. I am rewriting everything directly in Xquery (to replace XSLT), and I have hit upon an unexpected decision. I am used to using xi:includes
to traverse the corpus and the various XMLs files. But reading the documentation of eXist DB, it seems that the encouraged practice is to use collections and query them directly, instead of navigating via xi:include
s. It also seems that eXist-DB does not support the full implementation of xi:includes
anyway and requires some work arounds?
I am looking for guidance as to best practices of eXist-DB/Xquery in this context.
Many thanks in advance.
Correct, eXist's XInclude implementation is focused on output (i.e., serialization) rather than on querying or indexing. As eXist's documentation page on XInclude states:
The XInclude processor is implemented as a filter in between the serializer's output event stream and the receiver... XInclude processing is therefore applied whenever eXist-db serializes an XML fragment, whether it's a document, the result of an XQuery or an XSLT stylesheet.
Thus, if you use XInclude to assemble your corpus and you want to query/traverse this corpus, you could do so by (1) writing a query to read your XInclude and following it like a map to find the component documents, (2) pre-serializing your data into a new document and then querying the resulting document directly, or (3) placing the documents into collections that facilitate the kinds of queries you want to do.
Depending on the size of those thousands of documents, traversing the xinclude when running xqueries tends to be slow and quite memory intensive. In my experience Joe's option 3 is usually the way to go.
Unlike with straight-up xslt, in exist-db you can define indexes. E.g. you have a <listPerson>
element as a wrapper for 1000s xincludes going to <person>
elements as root of their own document.
If you have defined and index for <person>
you can use e.g. ft:query()
to query the index directly, irrespective of where in the tree of sub-collections and documents the element is located. This tends to be orders of magnitude faster, compared to traversing the whole document starting at master, and resolving xincludes.
As for validation, you will need to decide if a full validation run of the whole expanded document is really always necessary. This requires some fiddling, but there isn't much general advice I can offer, without seeing the actual files and code.
You can find more information about indexing in exist in the documentation