I have a weighted graph 30k nodes 160k edges, no negative weights. I would like to compute all the shortest paths from all the nodes to the others. I think I cannot assume any particular heuristics to simplify the problem.
I tried to use this Dijkstra C implementation http://compprog.wordpress.com/2007/12/01/one-source-shortest-path-dijkstras-algorithm/, that is for a single shortest path problem, calling the function dijkstras() for all my 30 nodes. As you can imagine, it takes ages. At the moment I don't have the time to write and debug the code by myself, I have to compute this paths as soon as possible and store them in a database so I am looking for another faster solution ready to use, do you have any tips?
I have to run it on a recent macbook pro with 8GB ram and I would like to find a solution that takes not more than 24 hours to finish the computation.
Thanks a lot in advance!!
Eugenio
If you can modify the algorithm to be multi-threaded you might be able to finish it in less than 24hrs.
The first node may be taking more than 1 minute. However, the 15,000th node should only take half that time because you would have calculated the shortest paths to all of the previous nodes.
Bottleneck can be your data structure that you use storing paths. If you use too much storage you run out of cache and memory space very soon causing fast algorithm to run very slowly because it gains order of 100 (cache miss) or 10000+ (swapped pages) constant multiplier.
Because you have to store paths in database I suspect that might be easily a bottleneck. It is probably best to try to first generate paths into memory with very efficient storage mode like N bits per vertex where N == maximum number of edges per vertex. Then set a bit to for each edge that can be used to generate one of the shortest paths. After generating this path information you can run a recursive algorithm generating the path information to a format suitable for database storage.
Of course the most likely bottleneck is still the database. You want to think very carefully what format you use to store the information because insert, search and modifying large datasets in SQL database is very slow. Also using transaction to do database operations might be able to reduce the disk write overhead if database engine manages to path multiple insertions to a single disk write operation.
It can be even better to simple store results in memory cache and discard solutions when they are not actively needed any more. Then generate same results again if you happen to need them again. That means you would generate paths only on demand when you actually need them. Runtime for 30k nodes and 160k edges should be clearly below a second for single all shortest path run of Dijkstra.
For shortest path algorithms I have always chosen C++. There shouldn't be any reason why C implementation wouldn't be simple too but C++ offers reduced coding with STL containers that can be used in initial implementation and only later implement optimized queue algorithm if benchmarks and profiling shows that there is need to have something better than STL offers.
Does your graph have any special structure? Is the graph planar (or nearly so)?
I'd recommend not trying to store all shortest paths, a pretty dense encoding (30k^2 "where to go next" entries) will take up 7 gigs of memory.
What is the application? Are you sure that doing a bidirectional Dijkstra (or A*, if you have a heuristic) won't be fast enough when you need to find a particular shortest path?
I looked over the Dijkstra's algorithm link that you posted in the comments and I believe that it's the source of your inefficiency. Inside the inner Dijkstra's loop, it's using an extremely unoptimized approach to determine which node to explore next (a linear scan over every node at each step). The problematic code is in two spots. The first is this code, which tries to find the next node to operate on:
Because this code is nested inside of a loop that visits every node, the complexity (as mentioned in the link) is O(|V|2), where |V| is the number of nodes. In your case, since |V| is 30,000, there will be nine hundred million iterations of this loop overall. This is painfully slow (as you've seen), but there's no reason to have to do this much work.
Another trouble spot is here, which tries to find which edge in the graph should be used to reduce the cost of other nodes:
This scans over an entire row in the adjacency matrix looking for nodes to consider, which takes time O(n) irrespective of how many outgoing edges leave the node.
While you could try fixing up this version of Dijkstra's into a more optimized implementation, I think the correct option here is just to throw this code away and find a better implementation of Dijkstra's algorithm. For example, if you use the pseudocode from the Wikipedia article implemented with a binary heap, you can get Dijkstra's algorithm running in O(|E| log |V|). In your case, this value is just over two million, which is about 450 times faster than your current approach. That's a huge difference, and I'm willing to bet that with a better Dijkstra's implementation you'll end up getting the code completing in a substantially shorter time than before.
On top of this, you might want to consider running all the Dijkstra searches in parallel, as Jacob Eggers has pointed out. This cam get you an extra speed boost for each processor that you have. Combined with the above (and more critical) fix, this should probably give you a huge performance increase.
If you plan on running this algorithm on a much denser data set (one where the number of edges approaches |V|2 / log |V|), then you may want to consider switching to the Floyd-Warshall algorithm. Running Dijkstra's algorithm once per node (sometimes called Johnson's algorithm) takes time O(|V||E| log |V|) time, while using Floyd-Warshall takes O(|V|3) time. However, for the data set you've mentioned, the graph is sufficiently sparse that running multiple Dijkstra's instances should be fine.
Hope this helps!
How about the Floyd-Warshall algorithm?