SPARQL full aggregation on a group aggregation

2019-03-03 12:41发布

问题:

I have an Ontology where users can use one of five predicates to express how much they like an item.

The Ontology contains specific predicates that have a property called hasSimilarityValue.

I am trying to do the following:

  1. Having a user let's say rs:ania
  2. Extract all the items that this user has rated before. (this is easy because the Ontology already contains triple from the user to the items)
  3. Extract similary items to the items that have been extracted in step 2 and calculate their similarities. (here we are using our own approach to calculate the similarites ). However the issue is: from step 2, we have many items the user has rated, from step there we are extracting and calculating similar items to these items that came from step 2. So, it is possible that an item in step 3 is similar to two (or more) items from step 2. Thus we end up with the following:

    user :ania rated item x1 user :ania rated item x2 item y is similar by y1 to x1 item y is similar by y2 to x2 item z is similar by z1 to x1

y1, y2, and z1 are values between 0 and 1

the thing is that we need to normalize these values to know the final similarities for item y and item z.

the normalization is simple, just group by the item and divide by the maximum number of items

so to know the similarity with y, i should do (y1+y2/2)

to know the similarity with z, i should do (z1/2)

my problem

as you see, i need to count the items and then know the max of this count

this is the query that calculates everything without the normalization part

select  ?s  (sum(?weight * ?factor) as ?similarity) ( sum(?weight * ?factor * ?ratings) as ?similarityWithRating) (count(distinct ?x) as ?countOfItemsUsedInDeterminingTheSimilarities) where {

    values (?user) { (rs:ania) }
    values (?ratingPredict) {(rs:ratedBy4Stars)  (rs:ratedBy5Stars)}
    ?user ?ratingPredict ?x.
    ?ratingPredict rs:hasRatingValue ?ratings.
    {
      ?s ?p ?o .
      ?x ?p ?o .
      bind(4/7 as ?weight)
    }
    union
    {
      ?s ?a ?b . ?b ?p ?o .
      ?x ?c ?d . ?d ?p ?o .
      bind(1/7 as ?weight)
    }
    ?p rs:hasSimilarityValue ?factor .
      filter (?s != ?x)
  }
  group by ?s

order by ?s

the result is:

now I need to divide each row by the maximum of the count column,

my proposed solution is to repeat the exact query twice, once to get the similarities and once to get the max and then join them and then do the divide (normalization). it is working but it is ugly, the performance will be disaster because i am repeating the same query twice. it is stupid solution and i would like to ask you guys for a better one please

here is my stupid solutions

 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rs: <http://www.musicontology.com/rs#>
PREFIX pdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

#select 
#?s   ?similarityWithRating (max(?countOfItemsUsedInDeterminingTheSimilarities) as ?maxNumberOfItemsUsedInDeterminingTheSimilarities)
#where {
 # {
select ?s ?similarity ?similarityWithRating ?countOfItemsUsedInDeterminingTheSimilarities ?maxCountOfItemsUsedInDeterminingTheSimilarities ?finalSimilarity where {
{
select  ?s  (sum(?weight * ?factor) as ?similarity) ( sum(?weight * ?factor * ?ratings) as ?similarityWithRating) (count(distinct ?x) as ?countOfItemsUsedInDeterminingTheSimilarities) where {

    values (?user) { (rs:ania) }
    values (?ratingPredict) {(rs:ratedBy4Stars)  (rs:ratedBy5Stars)}
    ?user ?ratingPredict ?x.
    ?ratingPredict rs:hasRatingValue ?ratings.
    {
      ?s ?p ?o .
      ?x ?p ?o .
      bind(4/7 as ?weight)
    }
    union
    {
      ?s ?a ?b . ?b ?p ?o .
      ?x ?c ?d . ?d ?p ?o .
      bind(1/7 as ?weight)
    }
    ?p rs:hasSimilarityValue ?factor .
      filter (?s != ?x)
  }
  group by ?s
#}

#}
#group by ?s 
order by ?s
} #end first part
{
select (Max(?countOfItemsUsedInDeterminingTheSimilarities) as ?maxCountOfItemsUsedInDeterminingTheSimilarities) where {
select  ?s  (sum(?weight * ?factor) as ?similarity) ( sum(?weight * ?factor * ?ratings) as ?similarityWithRating) (count(distinct ?x) as ?countOfItemsUsedInDeterminingTheSimilarities) where {

    values (?user) { (rs:ania) }
    values (?ratingPredict) {(rs:ratedBy4Stars)  (rs:ratedBy5Stars)}
    ?user ?ratingPredict ?x.
    ?ratingPredict rs:hasRatingValue ?ratings.
    {
      ?s ?p ?o .
      ?x ?p ?o .
      bind(4/7 as ?weight)
    }
    union
    {
      ?s ?a ?b . ?b ?p ?o .
      ?x ?c ?d . ?d ?p ?o .
      bind(1/7 as ?weight)
    }
    ?p rs:hasSimilarityValue ?factor .
      filter (?s != ?x)
  }
  group by ?s
#}

#}
#group by ?s 
order by ?s
}
}#end second part
  bind (?similarityWithRating/?maxCountOfItemsUsedInDeterminingTheSimilarities as ?finalSimilarity)
}
order by desc(?finalSimilarity)

Finally

Here is the data if you want to try it yourself. http://www.mediafire.com/view/r4qlu3uxijs4y30/musicontology

回答1:

It's really helpful if you can provide data to work with in these examples that's minimal. That means data that doesn't have stuff we don't need in order to solve the problem, and that is pretty much as simple as possible. I think that How to create a Minimal, Complete, and Verifiable example might be useful for your Stack Overflow questions.

Anyhow, here's some simple data that should be enough for us to work with. There are two users who have made some ratings, and some similarities in the data. Note that I made the similarities directed; you'd probably want them to be bidirectional, but that's not really the main part of this problem.

@prefix : <urn:ex:>

:user1 :rated :a , :b .

:user2 :rated :b , :c , :d .

:a :similarTo [ :piece :c ; :value 0.1 ] ,
              [ :piece :d ; :value 0.2 ] .

:b :similarTo [ :piece :d ; :value 0.3 ] ,
              [ :piece :e ; :value 0.4 ] .

:c :similarTo [ :piece :e ; :value 0.5 ] ,
              [ :piece :f ; :value 0.6 ] .

:d :similarTo [ :piece :f ; :value 0.7 ] ,
              [ :piece :g ; :value 0.8 ] .

Now, the query just needs to retrieve a user and the pieces that they've rated, along with similar pieces and the actual similarity values. Now, if you group by the user and the similar piece, you end up with a groups that have a single similar piece, a single user, and a bunch of rated pieces and their similarity to the similar piece. Since all the similarity ratings are in a fixed range (0,1), you can just average them to get overall similarity. In this query, I've also added a group_concat to show which rated pieces the similarity value is based on.

prefix : <urn:ex:>

select
    ?user
    (group_concat(?piece) as ?ratedPieces)
    ?similarPiece
    (avg(?similarity_) as ?similarity)
where {
  #-- Find ?pieces that ?user has rated.
  ?user :rated ?piece .

  #-- Find other pieces (?similarPiece) that are
  #-- similar to ?piece, along with the
  #-- similarity value (?similarity_)
  ?piece :similarTo [ :piece ?similarPiece ; :value ?similarity_ ] .
}
group by ?user ?similarPiece
------------------------------------------------------------
| user   | ratedPieces         | similarPiece | similarity |
============================================================
| :user1 | "urn:ex:a"          | :c           | 0.1        | ; a-c[0.1]
| :user1 | "urn:ex:b urn:ex:a" | :d           | 0.25       | ; b-d[0.3], a-d[0.2]
| :user1 | "urn:ex:b"          | :e           | 0.4        | ; b-e[0.4]
| :user2 | "urn:ex:b"          | :d           | 0.3        | ; b-d[0.3]
| :user2 | "urn:ex:c urn:ex:b" | :e           | 0.45       | ; c-e[0.5], b-e[0.4]
| :user2 | "urn:ex:d urn:ex:c" | :f           | 0.65       | ; d-f[0.7], c-f[0.6]
| :user2 | "urn:ex:d"          | :g           | 0.8        | ; d-g[0.8]
------------------------------------------------------------