通过组出现最频繁选择值(select value that occurs most frequent

2019-09-30 17:09发布

我对医院的病人RDF数据,包括他们的出生日期。 经常有多个三元紧靠他们的出生日期 ,并且其中一些三元组可能是错误的 。 我的小组决定使用这个规则: 任何日期最常发生将暂时被认为是正确的 。 很显然,如何在我们的选择,外部SPARQL的任何编程语言做到这一点。

有可能在SPARQL聚合的聚集?

我看过类似的问题SPARQL选择计数器的MAX值 ,但我还没有。


鉴于这些三元组:

@prefix turbo: <http://example.org/ontologies/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/ontologies/b6be95364ec943af2ef4ab161c11c855>
  a <http://example.org/ontologies/StudyPartWithBBDonation> ;
  turbo:hasBirthDateO turbo:3950b2b6-f575-4074-b0e8-f9fa3378f3be, turbo:4250aafa-4b0c-4f73-92b6-7639f427b61d, turbo:a3e6676e-a214-4af4-b8ef-34a8e20170bf .

turbo:3950b2b6-f575-4074-b0e8-f9fa3378f3be turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:4250aafa-4b0c-4f73-92b6-7639f427b61d turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:a3e6676e-a214-4af4-b8ef-34a8e20170bf turbo:hasDateValue "1971-12-30"^^xsd:date .
turbo:6e200ca0d5150282787464a2bda55814
  a turbo:StudyPartWithBBDonation ;
  turbo:hasBirthDateO turbo:b09519f5-b123-40d5-bb4a-737ec9f8b9a8, turbo:06c56881-a6c7-4d1d-993b-add8862dffd7, turbo:12ef184d-c8d6-4d93-a558-a3ba47bb56ca .

turbo:b09519f5-b123-40d5-bb4a-737ec9f8b9a8 turbo:hasDateValue "2000-04-04"^^xsd:date .
turbo:06c56881-a6c7-4d1d-993b-add8862dffd7 turbo:hasDateValue "2000-04-04"^^xsd:date .
turbo:12ef184d-c8d6-4d93-a558-a3ba47bb56ca turbo:hasDateValue "2000-04-05"^^xsd:date .

该查询

PREFIX  turbo: <http://example.org/ontologies/>
SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount) 
  { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
           turbo:hasBirthDateO  ?dob .
    ?dob   turbo:hasDateValue   ?xsddate
  }
GROUP BY ?part ?xsddate

给出如下:

+----------------------------------------+------------------------+------------------+
|                  part                  |        xsddate         |    datecount     |
+----------------------------------------+------------------------+------------------+
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-05"^^xsd:date | "1"^^xsd:integer |
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |
+----------------------------------------+------------------------+------------------+

我只是想看到的是谁参与研究每个患者最高计数日期:

+----------------------------------------+------------------------+------------------+
|                  part                  |        xsddate         |    datecount     |
+----------------------------------------+------------------------+------------------+
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |
+----------------------------------------+------------------------+------------------+

我觉得我越来越接近这里。 现在,我需要得到的计数和最大在同一行计数!

PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  turbo: <http://example.org/ontologies/>

SELECT  ?part ?xsddate ?datecount ?countmax
WHERE
  {   { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
        WHERE
          { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                   turbo:hasBirthDateO  ?dob .
            ?dob   turbo:hasDateValue   ?xsddate
          }
        GROUP BY ?part ?xsddate
      }
    UNION
      { SELECT  ?part (MAX(?datecount) AS ?countmax)
        WHERE
          { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
            WHERE
              { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                       turbo:hasBirthDateO  ?dob .
                ?dob   turbo:hasDateValue   ?xsddate
              }
            GROUP BY ?part ?xsddate
          }
        GROUP BY ?part
      }
  }

+----------------------------------------+------------------------+------------------+------------------+
|                  part                  |        xsddate         |    datecount     |     countmax     |
+----------------------------------------+------------------------+------------------+------------------+
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-05"^^xsd:date | "1"^^xsd:integer |                  |
| turbo:b6be95364ec943af2ef4ab161c11c855 | "1971-12-30"^^xsd:date | "3"^^xsd:integer |                  |
| turbo:6e200ca0d5150282787464a2bda55814 | "2000-04-04"^^xsd:date | "2"^^xsd:integer |                  |
| turbo:6e200ca0d5150282787464a2bda55814 |                        |                  | "2"^^xsd:integer |
| turbo:b6be95364ec943af2ef4ab161c11c855 |                        |                  | "3"^^xsd:integer |
+----------------------------------------+------------------------+------------------+------------------+

Answer 1:

从本质上讲,你只需要更换UNION. 在您的查询(或者你可以只删除此UNION ,如@AKSW已经在下面的评论中指出)。

在GraphDB,然而,你将收到一个错误:

变量?datecount在以前的投影已被使用。 绑定不是通过预测,因为芝麻2.8传播,因此这可能导致在查询中的逻辑错误。

因此,这种方法更改您的查询:

PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  turbo: <http://example.org/ontologies/>

SELECT  ?part ?xsddate ?datecount_ ?countmax
WHERE
  {   { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount_)
        WHERE
          { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                   turbo:hasBirthDateO  ?dob .
            ?dob   turbo:hasDateValue   ?xsddate
          }
        GROUP BY ?part ?xsddate
      }
      .
      { SELECT  ?part (MAX(?datecount) AS ?countmax)
        WHERE
          { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
            WHERE
              { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                       turbo:hasBirthDateO  ?dob .
                ?dob   turbo:hasDateValue   ?xsddate
              }
            GROUP BY ?part ?xsddate
          }
        GROUP BY ?part
      }
  }

在Blazegraph,你可以使用命名的子查询 :

PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  turbo: <http://example.org/ontologies/>

SELECT  ?part ?xsddate ?datecount ?countmax

WITH 
  { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
      WHERE
        { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                 turbo:hasBirthDateO  ?dob .
          ?dob   turbo:hasDateValue   ?xsddate
        }
       GROUP BY ?part ?xsddate
  } AS %sub

WHERE
  {  { SELECT ?part (MAX(?datecount) AS ?countmax)
       WHERE { INCLUDE %sub } GROUP BY ?part
     }
      INCLUDE %sub
  }


Answer 2:

我对斯坦尼斯的真棒答案阐述

  • 改名?datecount于一体的{}模式
  • 添加了一个过滤器
  • 插入的共识DOB到triplestore中的一个命名图

PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  turbo: <http://example.org/ontologies/>

INSERT {
  GRAPH turbo:DOB_conclusions {
    ?part turbo:hasBirthDateO ?DOBconc .
    ?DOBconc turbo:hasDateValue ?xsddate .
    ?DOBconc turbo:conclusionated true .
    ?DOBconc rdf:type <http://www.ebi.ac.uk/efo/EFO_0004950> .
  }
}
WHERE
  { { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount)
      WHERE
        { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                 turbo:hasBirthDateO  ?dob .
          ?dob   turbo:hasDateValue   ?xsddate
        }
      GROUP BY ?part ?xsddate
    }
    .
    { SELECT  ?part (MAX(?datecount2) AS ?countmax)
      WHERE
        { SELECT  ?part ?xsddate (COUNT(?xsddate) AS ?datecount2)
          WHERE
            { ?part  rdf:type             turbo:StudyPartWithBBDonation ;
                     turbo:hasBirthDateO  ?dob .
              ?dob   turbo:hasDateValue   ?xsddate
            }
          GROUP BY ?part ?xsddate
        }
      GROUP BY ?part
    }
    FILTER ( ?datecount = ?countmax )
    BIND(uri(concat("http://transformunify.org/ontologies/", struuid())) AS ?DOBconc)
  }


文章来源: select value that occurs most frequently by group
标签: sparql