Massive differences in speed with virtually identi

2019-08-09 05:23发布

问题:

I've been porting an SQL database over to exist over the last few weeks, and while I so far managed to get over all previous hindrances, I've now run into something for which neither the official documentation, searching online, nor common sense was able to help.

In short, I have a pretty big collection (about 90k entries, spread over 20 files), with most individual entries looking something like this (this is a massive simplification so I can get the point across):

<document>
 <document_id>Numerical Value</document_id>
 <page_id>Some other numerical value</page_id>
</document>

I then pass a value through php, lets call it $val The strange part is that when I run the standard query

$p in collection("/db/folder_location")//documentset/document[xs:integer(document_id) eq $val]

No matter what value I pass, it returns all the results in a matter of seconds. If I slightly modify it, however, making it:

$p in collection("/db/folder_location")//documentset/document[xs:integer(page_id) eq $val]

It either takes over 30 seconds to return the values or simply stays locked in a running query and never returns anything. Of all the 30 queries I already converted, this is the only time I ran into this problem and could not find a workaround.

回答1:

To address the query performance problem, I would suggest some changes to your query and/or the addition of a range index on the document_id and page_id elements.

Your query casts all document_id and page_id elements as xs:integer. This is an inefficient operation given a large dataset. Consider (a) removing this type casting, (b) reversing it (cast $val as xs:string), or (c) adding a range index on these two elements, with type="xs:integer". This latter option will let you remove the casting in your predicate (allowing you to change it to document[document_id eq $val] and document[page_id eq $val]), and the index should greatly speed the lookup.

To add a range index for your query, create a collection configuration document like this:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <range>
            <create qname="document_id" type="xs:integer"/>
            <create qname="page_id" type="xs:integer"/>
        </range>
    </index>
</collection>

If your data is in /db/folder_location, then store this document as collection.xconf in /db/system/config/db/folder_location. Then reindex your collection with xmldb:reindex("/db/system/config/db/folder_location"). As the documentation on range indexes states, with this index definition in place:

indexes will be used automatically for general or value comparisons as well as string functions like fn:contains, fn:starts-with, fn:ends-with.

For more on range indexes in eXist, see https://exist-db.org/exist/apps/doc/newrangeindex.xml. For query optimization techniques, see https://exist-db.org/exist/apps/doc/tuning.xml. For indexes in general in eXist, see https://exist-db.org/exist/apps/doc/indexing.xml.