I have implemented an unweighted random walk function for a graph that I built in Python using NetworkX. Below is a snippet of my program that deals with the random walk. Elsewhere in my program, I have a method that creates the graph, and I have a method that simulates various custom graph testing methods that I've written. One of these graph testing methods picks two nodes at random from the graph and runs a random walk between both of them. The two things that are being calculated from this Random Walk are hitting time (the number of links that are traversed from the starting to the ending point) and the commute time (the number of traversed links from starting to ending and back to the starting point).
def unweighted_random_walk(starting_point,ending_point, graph):
'''
starting_point: String that represents the starting point in the graph
ending_point: String that represents the ending point in the graph
graph: A NetworkX Graph object
'''
##Begin the random walk
current_point=starting_point
#current_node=graph[current_point]
current_point_neighors=graph.neighbors(current_point)
hitting_time=0
#Determine the hitting time to get to an arbitrary neighbor of the
#starting point
while current_point!=ending_point:
#pick one of the edges out of the starting_node with equal probs
possible_destination=current_point_neighbors[random.randint(0,current_point_neighors)]
current_point=possible_destination
current_point_neighbors=graph.neighbors(current_point)
hitting_time+=1
return hitting_time
My code for the random walk is pretty straight-forward because I'm just picking random nodes until the ending point is reached. However, this current implementation is very slow when I try running several random walks (I think I need to run a million at some point).
My question is: Is there any way that I can use Hadoop MapReduce to parallelize some of the operations that are going on here for this Random Walk? Is there a better way for me to do my random walk?
To address your question:
You need to address Ned's comment. He beat me to saying it. Explain your code; more on that later.
I can not fathom a walking algorithm that could be run in parallel. By their very nature, they are each a linear process; each step depends on the previous. You cannot possibly know what next node to jump to without knowing the previous node (with the exception of the starting node). If your code indeed represents a random walk where the choices are all independent of the previous ones, you need to explain that in your question.
Assuming each random walk is independent, however, you can run many random walks simultaneously. We call this scenario embarassingly parallel, and that's a very lucky thing.
I have no idea why you want to use Hadoop, specifically, here. The first step should be, "Can I just write this as a basic program and use a qsub (or equivalent) script to farm out a bunch of runs of this program to the server?" If the answer is no, the next step is, "Can I use the multiprocessing module?" If you go with multiprocessing, you might want to take a look at Jesse Noller's multiprocessing presentation from PyCon 2009.
Now, with regards to your particular code...
You need to explain what the nodes in your graph are. I'm confused why you're treating them like a dictionary (calling .keys()
) on them. If they are dictionaries, tell us what the keys and values are. I hope you're not storing the neighbors as keys there, because NetworkX already gives you that, via the Graph.neighbors()
method. If you're storing the neighbors of the nodes in the nodes themselves, you have a misunderstanding of the NetworkX library. Let the graph do the work for you.
You have the same logic twice in unweighted_random_walk()
, once for the trip from the start node to the destination node, then again for the destination node to the start node. Why? All you need is the logic for one direction. Call this function twice. Call it with the start and destination nodes as arguments to get the direction one way, then swap the order of the arguments to be destination then start to get the walk the other direction. You then have two independent calls, and can now run these in parallel.
Don't use while True:
—not just here, but in general. You should always indicate the actual condition under which to continue. e.g.,
while current_point != ending_point:
...
Don't return a string of the information, return the information directly. e.g.,
return hitting_time
Note that by following my advice in point 2 directly above, you only have to return the hitting time, and sum the hitting times for the there-call and the back-call to get the total commute time. Convenient, right?
See also
- the Disco project for a Python accessible MapReduce implementation
- Jesse Noller's presentation on concurrency and distributed computing from PyCon 2009
EDIT: Included links to Jesse Noller's presentations and to Disco.
I don't see how map-reduce can help you. It's used where you have a two-part operation: the first part is a calculation that can be performed independently on many different data elements, and the second part is somehow combining all those results. Perhaps there's a clever way to use map-reduce to help with this random walk, but I don't see it.
Your random walk is completely random: it could end up with many loops, even hopping back and forth between the same two nodes before continuing on. Perhaps you want to somehow constrain it so you don't have so large a space to search?
You don't have to actually perform the random walk if you use the formula detailed in this paper.