I am trying to query the frequency of certain attributes in Wikidata, using SPARQL.
For example, to find out what the frequency of different values for gender is, I have the following query:
SELECT ?rid (COUNT(?rid) AS ?count)
WHERE { ?qid wdt:P21 ?rid.
BIND(wd:Q5 AS ?human)
?qid wdt:P31 ?human.
} GROUP BY ?rid
I get the following result:
wd:Q6581097 2752163
wd:Q6581072 562339
wd:Q1052281 223
wd:Q1097630 68
wd:Q2449503 67
wd:Q48270 36
wd:Q44148 8
wd:Q43445 4
t152990852 1
t152990762 1
t152990752 1
t152990635 1
t152775383 1
t152775370 1
t152775368 1
...
I have the following questions regarding this:
- What do those
t152...
values refer to? - How can I ignore the tuples containing
t152...
?
I triedFILTER ( !strstarts(str(?rid), "wd:") )
but it timed out. - How can I count the distinct number of answers?
I triedSELECT (COUNT(DISTINCT ?rid) AS ?count)
with the above query, but again it timed out.
Values starting with
t
are "skolemized" unknown values (see, e.g., Q2423351 for a person of unknown sex or gender).In order to improve performance, I suggest you to divide your query into three parts:
All "normal" genders:
Please note that, according Wikidata, wd:Q746411 is a subclass of wd:Q48270, etc.
All "non-normal" genders:
I do not use
FILTER NOT EXISTS {?rid wdt:P31 wd:Q48264 }
due to performance reasons.All (i.e. 1) "unknown" genders:
In fact, it is not very important in your case — to count distinct wd:Q5 or count them not distinct — but the latter is preferable due to performance reasons.