Mysql vs Cypher vs Gremlin on union query

2019-06-09 20:40发布

问题:

I working on a project where I have to make real time recommandation based on filters. I decided to take a look on graph db and started to play with neo4j and compared it performance with mysql.

rows are about :

 "broadcast": 159844,
 "format": 5,
 "genre": 10,
 "program": 60495

the mysql the query look like :

select f.id, sum(weight) as total
from
(
    select program.id, 15 as weight
    from broadcast
    inner join program on broadcast.programId = program.id
    where broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
    group by program.id
union all
    select program.id, 10 as weight
    from broadcast
    inner join program on broadcast.programId = program.id
    inner join genre ON program.genreId = genre.id
    where genre.id in (13) and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
    group by program.id
union all
    select program.id, 5 as weight
    from broadcast
    inner join program on broadcast.programId = program.id
    inner join genre ON program.genreId = genre.id
    inner join format on genre.formatId = format.id
    where format.id = 6 and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
    group by program.id
) f
group by f.id
order by total desc, id desc
limit 0, 50

On my local machine the query perform in about 300ms. It may be acceptable but under 100ms would be better for real time processing.

I also have written with some help a thinkerpop3 query :

g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.in('programBroadcast')
.dedup()
.union(
   filter{true}
     .as('p', 'w').select('p', 'w').by('id').by(constant(15)),
   filter(out('programGenre').has('id', 4))
     .as('p', 'w').select('p', 'w').by('id').by(constant(10)),
   filter(out('programGenre').out('genreFormat').has('id', 4))
     .as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.order(local).by(valueDecr)
.limit(local, 50)

the query perform about 700ms !!

===== EDIT =====

since I wanted to display the profiling of the query and I got :

Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
Neo4jGraphStep([],vertex)                                         220513      220513       14135.788    68.52
HasStep([~label.eq(broadcast)])                                   159844      159844         391.087     1.90
VertexStep(IN,[programBroadcast],vertex)                          159825      159825         267.202     1.30
DedupGlobalStep                                                    60495       60495          95.848     0.46
UnionStep([[LambdaFilterStep(lambda)@[p, w], Pr...                 63247       63247        2008.553     9.74
  LambdaFilterStep(lambda)@[p, w]                                  60495       60495         194.406
  SelectStep([p, w],[value(id), [ConstantStep(1...                 60495       60495         487.158
    ConstantStep(15)                                               60495       60495          24.214
  EndStep                                                          60495       60495         110.575
  TraversalFilterStep([VertexStep(OUT,[programG...                  2070        2070         410.689
    VertexStep(OUT,[programGenre],vertex)                          22540       22540         191.158
    HasStep([id.eq(6)])                                                0           0         140.934
  SelectStep([p, w],[value(id), [ConstantStep(1...                  2070        2070          52.203
    ConstantStep(10)                                                2070        2070           0.654
  EndStep                                                           2070        2070          43.120
  TraversalFilterStep([VertexStep(OUT,[programG...                   682         682         443.347
    VertexStep(OUT,[programGenre],vertex)                          22540       22540         119.115
    VertexStep(OUT,[genreFormat],vertex)                           27510       27510         117.410
    HasStep([id.eq(1)])                                                0           0         133.517
  SelectStep([p, w],[value(id), [ConstantStep(5...                   682         682          43.247
    ConstantStep(5)                                                  682         682           0.217
  EndStep                                                            682         682          44.427
GroupStep([SelectOneStep(p), ProfileStep],[Sele...                     1           1        3583.249    17.37
  SelectOneStep(p)                                                 63247       63247          26.836
  SelectOneStep(w)                                                 63247       63247          81.623
  SumGlobalStep                                                    60495       60495        3107.593
  SelectOneStep(w)                                                     0           0           0.000
  SumGlobalStep                                                        0           0           0.000
UnfoldStep                                                         60495       60495          17.227     0.08
OrderGlobalStep([valueDecr])                                       60495       60495         114.439     0.55
FoldStep                                                               1           1          16.902     0.08
RangeLocalStep(0,10)                                                   1           1           0.081     0.00
SideEffectCapStep([~metrics])                                          1           1           0.215     0.00
- show quoted text -

this shows that 68% of the time happened to the g.V() that has no index!

suddenly I tried to find a way to have a single started point, so I did :

graph.addVertex(label, 'day', 'id', 1)
graph.cypher("CREATE INDEX ON :day(id)")
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.map{it.get().addEdge('broadcastDay', g.V().hasLabel('day').has('id', 1).next())}

and now the query look like :

g.V(14727)
.in('broadcastDay')
.in('programBroadcast')
.union(
    filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
   filter(out('programGenre').has('id', 4))
   .as('p', 'w').select('p', 'w').by('id').by(constant(10)),
   filter(out('programGenre').out('genreFormat').has('id', 4))
   .as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.unfold().order().by(valueDecr).fold()
.limit(local, 50)

and the execution time is now 137ms !

===== END EDIT =====

Neo4j is slower than mysql in my case...

So I decided to make the query in cypher (thanks to this post) with this naive approach :

WITH [] AS res

MATCH (b:broadcast)-[:programBroadcast]-(p:program)
WHERE b.startedAt > timestamp() and b.startedAt < (timestamp() + 1000 * 60 * 60 * 24)

OPTIONAL MATCH (p)
WITH res, COLLECT({id: p.id, weight: 15}) AS data
WITH res + data AS res

OPTIONAL MATCH (p)-[:programGenre]-(g:genre{id:4})
WITH res, (CASE WHEN g IS NOT NULL THEN COLLECT({id: p.id, weight: 10}) ELSE [] END) AS data
WITH res + data AS res

OPTIONAL MATCH (p)-[:programGenre]-(g:genre)-[:genreFormat]-(f:format{id:4})
WITH res, (CASE WHEN f IS NOT NULL THEN COLLECT({id: p.id, weight: 5}) ELSE [] END) AS data
WITH res + data AS res

UNWIND res AS result
WITH result, result.id as id, SUM(result.weight) as weight
ORDER BY weight DESC
LIMIT 10
RETURN id, weight     

I'm about 68614ms !!

I'm very disappointed by graph db, but I don't understand why, I used indices on every properties and set java memory about 4g, and it's stuck compared to mysql, why ? graph db only for big data where mysql can't perform join ?