Finding Connected Components using Hadoop/MapReduc

I need to find connected components for a huge dataset. (Graph being Undirected)

One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself.

I was just wondering if there is any existing API for the same since it is a very common problem in Social Network Analysis?

Or atleast if anyone is aware of any reliable(tried and tested) source using which atleast i can get started with the implementation myself?

Thanks

标签： graph hadoop mapreduce social-networking graph-algorithm

4条回答

一夜七次

2楼-- · 2019-03-16 02:41

I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).

The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.

I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-03-16 02:45

I blogged about it for myself:

http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html

But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.

I've written a connected components algorithm with MapReduce here: (Mindist search)

https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce

Also a BSP version for Apache Hama can be found here:

https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java

The implementation isn't as difficult as in MapReduce and it is at least 10 times faster. If you're interested, checkout the latest version in TRUNK and visit our mailing list.

http://hama.apache.org/

http://apache.org/hama/mail-lists.html

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

4楼-- · 2019-03-16 02:50

You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.

The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations U Kang, Charalampos E. Tsourakakis, Christos Faloutsos In IEEE International Conference on Data Mining (ICDM 2009)

EDIT: The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).

0人赞添加讨论(0) 举报

唯我独甜

5楼-- · 2019-03-16 02:58

It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.

https://github.com/kwartile/connected-component

0人赞添加讨论(0) 举报

Finding Connected Components using Hadoop/MapReduc

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间