convert sparql query to sql query in pyspark-dataf

2019-08-20 06:35发布

问题:

I have dataset that consist of three columns subject, predicate, and object

subject      predicate    object
   c1            B          V3
   c1            A          V3
   c1            T          V2
   c2            A          V2
   c2            A          V3
   c2            T          V1
   c2            B          V3
   c3            B          V3
   c3            A          V3
   c3            T          V1
   c4            A          V3
   c4            T          V1
   c5            B          V3
   c5            T          V2
   c6            B          V3
   c6            T          V1

I want to apply association mining rules on this data by using sql queries. I take this idea from this paper Association Rule Mining on Semantic data by using sparql(SAG algorithm)

first, the user has to specify T (target predicate) and minimum support,then query if this T is frequent or not:

SELECT ?pt ?ot (COUNT(*) AS ?Yent)
WHERE {?s ?pt ?ot.
FILTER (regex (str(?pt), 'T', 'i'».}
GROUP BY ?pt ?ot
HAVING (?Yent >= 2)   

I tried following code and I got same result:

q=mtcars1.select('s','p','o').where(mtcars1['p']=='T')
q1=q.groupBy('p','o').count()
q1.filter(q1['count']>=2).show()

result

+---+---+-----+
|  p|  o|count|
+---+---+-----+
|  T| V2|    2|
|  T| V1|    4|
+---+---+-----+

second query to calculate other predicates and objects if they are frequent:

q2=mtcars1.select('s','p','o').where(mtcars1['p']!='T')
q3=q2.groupBy('p','o').count()
q3.filter(q3['count']>=2).show() 

result

+---+---+-----+
|  p|  o|count|
+---+---+-----+
|  A| V3|    4|
|  B| V3|    5|
+---+---+-----+

in order to find rules between two above queries, we will scan dataset again and find if they are repeated together greater than or equal minimum support

SELECT ?pe ?oe ?pt ?ot (count(*) AS ?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pe ?oe .
FILTER (!regex (str(?pe), 'T','i'».}
GROUP BY ?pe ?oe ?pt ?ot
HAVING (?supCNT >= I)
ORDER BY ?pt ?ot

I tried to store subject in list then join between items ,but this took long time, and this will take very long time if data is very large.

w=mtcars1.select('s','p' ,'o').where(mtcars1['p']=='T')
w1=w.groupBy('p','o').agg(collect_list('s')).show()

result

+---+---+----------------+
|  p|  o| collect_list(s)|
+---+---+----------------+
|  T| V2|        [c1, c5]|
|  T| V1|[c2, c3, c4, c6]|
+---+---+----------------+
w2=mtcars1.select('s','p' ,'o').where(mtcars1['p']!='T')
w3=w2.groupBy('p','o').agg(collect_list('s')).show()

result

+---+---+--------------------+
|  p|  o|     collect_list(s)|
+---+---+--------------------+
|  A| V3|    [c1, c2, c3, c4]|
|  B| V3|[c1, c2, c3, c5, c6]|
|  A| V2|                [c2]|
+---+---+--------------------+

join code

    from pyspark.sql.functions import *
w44=w1.alias("l")\
    .crossJoin(w3.alias("r"))\
    .select(
        f.col('l.p').alias('lp'),
        f.col('l.o').alias('lo'),
        f.col('r.p').alias('rp'),
        f.col('r.o').alias('ro'),
        intersection_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('TID'),
        intersection_length_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('len')
    )\
    .where(f.col('len') > 1)\
    .select(
        f.struct(f.struct('lp', 'lo'), f.struct('rp', 'ro')).alias('2-Itemset'), 
        'TID'
    )\
    .show()

result

+---------------+------------+
|      2-Itemset|         TID|
+---------------+------------+
|[[T,V2],[B,V3]]|    [c1, c5]|
|[[T,V1],[A,V3]]|[c3, c2, c4]|
|[[T,V1],[B,V3]]|[c3, c2, c6]|
+---------------+------------+

so,I have to re scan dataset again and find association rules between items, and re scan again to find again rules. following query is used to construct 3-factor set:

SELECT ?pel ?oel ?pe2 ?oe2 ?pt ?ot (eount(*) AS
?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pel ?oel .
FILTER (!regex (str(?pel), 'T','i'».
FILTER (!regex (str(?pc2), 'T','i')&& !regex
(str(?pc2), str(?pcl),'i') ).}
GROUP BY ?pcl ?ocl ?pc2 ?oc2 ?pt ?ot
HAVING (?supCNT >=2)
ORDER BY ?pt ?ot

result for this query should be

{[(A, V3) (B, V3) (T, V1), 2]}

and we will repeat queries until no other rules between items can anyone help me how can make association rules by sql queries,where subject is used as ID ,predicate + object=items