I have my own dataset and I want to perform a federated query in SPARQL. Here is the query:
PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
select * where {
?bioentity :hasMutatedVersionOf ?gene .
?gene :partOf wd:Q430258 .
SERVICE <https://query.wikidata.org/sparql> {
?gene p:P644 ?statement;
wdt:P31 wd:Q7187 ;
wdt:P703 wd:Q15978631 ;
wdt:P1057 wd:Q430258 .
?statement ps:P644 ?start .
?statement pq:P659 wd:Q20966585 .
?gene p:P645 ?statement2.
?statement2 ps:P645 ?end .
?statement2 pq:P659 wd:Q20966585 .
FILTER (xsd:integer(?start)>21000000 && xsd:integer(?start)<30000000)
}
}
I run the query via graphDB SPARQL interface but it's really really slow. It takes more than a minute to return 8 records. If I split the query in two parts, they are ridiculously fast.
Query#1
select * where {
?bioentity :hasMutatedVersionOf ?gene .
?gene :partOf wd:Q430258 .
}
56 records in 0.1s
Query#2
select * where {
SERVICE <https://query.wikidata.org/sparql> {
?gene p:P644 ?statement;
wdt:P31 wd:Q7187 ;
wdt:P703 wd:Q15978631 ;
wdt:P1057 wd:Q430258 .
?statement ps:P644 ?start .
?statement pq:P659 wd:Q20966585 .
?gene p:P645 ?statement2.
?statement2 ps:P645 ?end .
?statement2 pq:P659 wd:Q20966585 .
FILTER (xsd:integer(?start)>21000000 && xsd:integer(?start)<30000000)
}
}
158 records in 0.5s
Why the is the federation so slow? Is there a way to optimize the performance?
Short answer
Just place your
SERVICE
part first, i. e. before?bioentity :hasMutatedVersionOf ?gene .
Read a good article on the topic (e. g. chapter 5 of this book)
Relevant quote from the aforementioned article:
Long answer
Example data
The total number of triples is 79. Please note that
26000000
is used instead of21000000
.Query 1
Query 2
Performance
GraphDB behaviour
Executing Query 1, GraphDB performs 79 distinct
GET
requests to Wikidata¹:These requests are queries of this kind:
It seems interesting, that on another machine, GraphDB performs
GET
requests of another kind:In this request, Sesame protocol is used, these bindings in URL are not a part of SPARQL 1.1 Protocol.
Perhaps the exact kind of a request depends on the value of the internal
reuse.vars.in.subselects
parameter, which default value is presumably different on Windows and on Linux.Blazegraph behaviour
Executing Query 1, Blazegraph performs single
POST
request to Wikidata²:Conclusion
With federated queries, it is hard to create effective execution plan, since selectivity of remote patterns is unknown.
In your particular case, it should be not very important, whether to join results locally or remotely, because both local and remote resultsets are small. However, in GraphDB, joining results remotely is less effective, because GraphDB does not reduce communication costs.
¹ For screenshots creation,
<http://query.wikidata.org/sparql>
instead of<https://query.wikidata.org/sparql>
was used.² In Blazegraph, one might write
hint:Query hint:optimizer "None"
to ensure sequential evaluation.