Neo4j query for shortest path stuck (Do not work)

2019-01-26 18:47发布

I made relation graph two relationship, like if A knows B then B knows A, Every node has unique Id and Name along with other properties.. So my graph looks like

enter image description here

if I trigger a simple query MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}),path = (p1)-[:NAVIGATE_TO*]-(p2) RETURN path it did not give any response and consumes 100% CPU and RAM of the machine.


UPDATED As I read though posts and from comments on this post I simplified the model and relationship. Now it ends up to enter image description here

Each relationship has different weights, to simplify consider horizontal connections weight 1, vertical weights 1 and diagonal relations have weights 1.5 In my database there are more than 85000 nodes and 0.3 Million relationships

Query with shortest path is not ends up to some result. It stuck in the processing and CPU goes to 100%

3条回答
神经病院院长
2楼-- · 2019-01-26 19:22

im afraid you wont be able to do much here. your graph is very specific, having a relation only to closest nodes. thats too bad cause neo4j is ok to play around the starting point +- few relations away, not over whole graph with each query

it means, once, you are 2 nodes away, the computational complexity raises up to:

8 relationships per node
distance 2
8 + 8^2

in general, the top complexity for a distance n is

O(8 + 8^n) //in case all affected nodes have 8 connections

you say, you got like ~80 000 of nodes.this means (correct me if im wrong), the longest distance of ~280 (from √80000). lets suppose your nodes

(p1:SearchableNode {name: "Ishaan"}), 
(p2:SearchableNode {name: "Garima"}),

to be only 140 hopes away. this will create a complexity of 8^140 = 10e126, im not sure if any computer in the world can handle this.

sure, not all nodes have 8 connections, only those "in the middle", in our example graph it will have ~500 000 relationships. you got like ~300 000, which is maybe 2 times less so lets supose the overal complexity for an average distance of 70 (out of 140 - a very relaxed bottom estimation) for nodes having 4 relationships in average (down from 8, 80 000 *4 = 320 000) to be

O(4 + 4^70) = ~10e42

one 1GHz CPU should be able to calculate this by:

-1000 000 per second
10e42 == 10e36 * 1 000 000 -> 10e36 seconds

lets supose we got a cluster of 100 10Ghz cpu serves, 1000 GHz in total. thats still 10e33 * 1 000 000 000 -> 10e33seconds

i would suggest to just keep away from AllshortestPaths, and look only for the first path available. using gremlin instead of cypher it is possible to implement own algorithms with some heuristics so actually you can cut down the time to maybe seconds or less.

exmaple: using one direction only = down to 10e16 seconds.

an example heuristic: check the id of the node, the higher the difference (subtraction value) between node2.id - node1.id, the higher the actual distance (considering the node creation order - nodes with similar ids to be close together). in that case you can either skip the query or just jump few relations away with something like MATCH n1-[:RELATED..5]->q-[:RELATED..*]->n2 (i forgot the syntax of defining exact relation count) which will (should) actually jump (instantly skip to) 5 distances away nodes which are closer to the n2 node = complexity down from 4^70 to 4^65. so if you can exactly calculate the distance from the nodes id, you can even match ... [:RELATED..65] ... which will cut the complexity to 4^5 and thats just matter of miliseconds for cpu.

its possible im completely wrong here. it has been already some time im our of school and would be nice to ask a mathematician (graph theory) to confirm this.

查看更多
迷人小祖宗
3楼-- · 2019-01-26 19:27
MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}),path = (p1)-[:NAVIGATE_TO*]->(p2) RETURN path

Or:

MATCH (p1:SearchableNode {name: "Ishaan"}), (p2:SearchableNode {name: "Garima"}), (p1)-[path:NAVIGATE_TO*]->(p2) RETURN path
查看更多
时光不老,我们不散
4楼-- · 2019-01-26 19:43

Let's consider what your query is doing:

MATCH (p1:SearchableNode {name: "Ishaan"}), 
      (p2:SearchableNode {name: "Garima"}),
      path = (p1)-[:NAVIGATE_TO*]-(p2) 
RETURN path

If you run this query in the console with EXPLAIN in front of it, the DB will give you its plan for how it will answer. When I did this, the query compiler warned me:

If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH

You have two issues going on with your query - first, you're assigning p1 and p2 independent of one another, possibly creating this cartesian product. The second issue is that because all of your links in your graph go both ways and you're asking for an undirected connection you're making the DB work twice as hard, because it could actually traverse what you're asking for either way. To make matters worse, because all of the links go both ways, you have many cycles in your graph, so as cypher explores the paths that it can take, many paths it will try will loop back around to where it started. This means that the query engine will spend a lot of time chasing its own tail.

You can probably immediately improve the query by doing this:

MATCH p=shortestPath((p1:SearchableNode {name:"Ishaan"})-[:NAVIGATE_TO*]->(p2:SearchableNode {name:"Garima"}))
RETURN p;

Two modifications here - p1 and p2 are bound to each other immediately, you don't separately match them. Second, notice the [:NAVIGATE_TO*]-> part, with that last arrow ->; we're matching the relationship ONE WAY ONLY. Since you have so many reflexive links in your graph, either way would work fine, but either way you choose you cut the work the DB has to do in half. :)

This may still perform not so great, because traversing that graph is still going to have a lot of cycles, which will send the DB chasing its tail trying to find the best path. In your modeling choice here, you usually shouldn't have relationships going both ways unless you need separate properties on each relationship. A relationship can be traversed in both directions, so it doesn't make sense to have two (one in each direction) unless the information that relationship is capturing is semantically different.

Often you'll find with query performance that you can do better by reformulating the query and thinking about it, but there's major interplay between graph modeling and overall performance. With the graph set up with so many bi-directional links, there will only be so much you can do to optimize path-finding.

查看更多
登录 后发表回答