I need to find connected components for a huge dataset. (Graph being Undirected)
One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself.
I was just wondering if there is any existing API for the same since it is a very common problem in Social Network Analysis?
Or atleast if anyone is aware of any reliable(tried and tested) source using which atleast i can get started with the implementation myself?
Thanks
I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).
The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.
I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.
I blogged about it for myself:
http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html
But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.
I've written a connected components algorithm with MapReduce here: (Mindist search)
https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce
Also a BSP version for Apache Hama can be found here:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java
The implementation isn't as difficult as in MapReduce and it is at least 10 times faster. If you're interested, checkout the latest version in TRUNK and visit our mailing list.
http://hama.apache.org/
http://apache.org/hama/mail-lists.html
You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.
The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.
EDIT: The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).
It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.
https://github.com/kwartile/connected-component