I working on a project where I have to make real time recommandation based on filters. I decided to take a look on graph db and started to play with neo4j and compared it performance with mysql.
rows are about :
"broadcast": 159844,
"format": 5,
"genre": 10,
"program": 60495
the mysql the query look like :
select f.id, sum(weight) as total
from
(
select program.id, 15 as weight
from broadcast
inner join program on broadcast.programId = program.id
where broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 10 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
where genre.id in (13) and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 5 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
inner join format on genre.formatId = format.id
where format.id = 6 and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
) f
group by f.id
order by total desc, id desc
limit 0, 50
On my local machine the query perform in about 300ms. It may be acceptable but under 100ms would be better for real time processing.
I also have written with some help a thinkerpop3 query :
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.in('programBroadcast')
.dedup()
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.order(local).by(valueDecr)
.limit(local, 50)
the query perform about 700ms !!
===== EDIT =====
since I wanted to display the profiling of the query and I got :
Step Count Traversers Time (ms) % Dur
=============================================================================================================
Neo4jGraphStep([],vertex) 220513 220513 14135.788 68.52
HasStep([~label.eq(broadcast)]) 159844 159844 391.087 1.90
VertexStep(IN,[programBroadcast],vertex) 159825 159825 267.202 1.30
DedupGlobalStep 60495 60495 95.848 0.46
UnionStep([[LambdaFilterStep(lambda)@[p, w], Pr... 63247 63247 2008.553 9.74
LambdaFilterStep(lambda)@[p, w] 60495 60495 194.406
SelectStep([p, w],[value(id), [ConstantStep(1... 60495 60495 487.158
ConstantStep(15) 60495 60495 24.214
EndStep 60495 60495 110.575
TraversalFilterStep([VertexStep(OUT,[programG... 2070 2070 410.689
VertexStep(OUT,[programGenre],vertex) 22540 22540 191.158
HasStep([id.eq(6)]) 0 0 140.934
SelectStep([p, w],[value(id), [ConstantStep(1... 2070 2070 52.203
ConstantStep(10) 2070 2070 0.654
EndStep 2070 2070 43.120
TraversalFilterStep([VertexStep(OUT,[programG... 682 682 443.347
VertexStep(OUT,[programGenre],vertex) 22540 22540 119.115
VertexStep(OUT,[genreFormat],vertex) 27510 27510 117.410
HasStep([id.eq(1)]) 0 0 133.517
SelectStep([p, w],[value(id), [ConstantStep(5... 682 682 43.247
ConstantStep(5) 682 682 0.217
EndStep 682 682 44.427
GroupStep([SelectOneStep(p), ProfileStep],[Sele... 1 1 3583.249 17.37
SelectOneStep(p) 63247 63247 26.836
SelectOneStep(w) 63247 63247 81.623
SumGlobalStep 60495 60495 3107.593
SelectOneStep(w) 0 0 0.000
SumGlobalStep 0 0 0.000
UnfoldStep 60495 60495 17.227 0.08
OrderGlobalStep([valueDecr]) 60495 60495 114.439 0.55
FoldStep 1 1 16.902 0.08
RangeLocalStep(0,10) 1 1 0.081 0.00
SideEffectCapStep([~metrics]) 1 1 0.215 0.00
- show quoted text -
this shows that 68% of the time happened to the g.V() that has no index!
suddenly I tried to find a way to have a single started point, so I did :
graph.addVertex(label, 'day', 'id', 1)
graph.cypher("CREATE INDEX ON :day(id)")
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.map{it.get().addEdge('broadcastDay', g.V().hasLabel('day').has('id', 1).next())}
and now the query look like :
g.V(14727)
.in('broadcastDay')
.in('programBroadcast')
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.unfold().order().by(valueDecr).fold()
.limit(local, 50)
and the execution time is now 137ms !
===== END EDIT =====
Neo4j is slower than mysql in my case...
So I decided to make the query in cypher (thanks to this post) with this naive approach :
WITH [] AS res
MATCH (b:broadcast)-[:programBroadcast]-(p:program)
WHERE b.startedAt > timestamp() and b.startedAt < (timestamp() + 1000 * 60 * 60 * 24)
OPTIONAL MATCH (p)
WITH res, COLLECT({id: p.id, weight: 15}) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre{id:4})
WITH res, (CASE WHEN g IS NOT NULL THEN COLLECT({id: p.id, weight: 10}) ELSE [] END) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre)-[:genreFormat]-(f:format{id:4})
WITH res, (CASE WHEN f IS NOT NULL THEN COLLECT({id: p.id, weight: 5}) ELSE [] END) AS data
WITH res + data AS res
UNWIND res AS result
WITH result, result.id as id, SUM(result.weight) as weight
ORDER BY weight DESC
LIMIT 10
RETURN id, weight
I'm about 68614ms !!
I'm very disappointed by graph db, but I don't understand why, I used indices on every properties and set java memory about 4g, and it's stuck compared to mysql, why ? graph db only for big data where mysql can't perform join ?