I have a simple twitter users graph with around 2 million nodes and 5 million edges. I'm trying to play around with Centrality. However, the calculation takes a really long time (more than an hour). I don't consider my graph to be super large so I'm guessing there might be something wrong with my code.
Here's my code.
%matplotlib inline
import pymongo
import networkx as nx
import time
import itertools
from multiprocessing import Pool
from pymongo import MongoClient
from sweepy.get_config import get_config
config = get_config()
MONGO_URL = config.get('MONGO_URL')
MONGO_PORT = config.get('MONGO_PORT')
MONGO_USERNAME = config.get('MONGO_USERNAME')
MONGO_PASSWORD = config.get('MONGO_PASSWORD')
client = MongoClient(MONGO_URL, int(MONGO_PORT))
db = client.tweets
db.authenticate(MONGO_USERNAME, MONGO_PASSWORD)
users = db.users
graph = nx.DiGraph()
for user in users.find():
graph.add_node(user['id_str'])
for friend_id in user['friends_ids']:
if not friend_id in graph:
graph.add_node(friend_id)
graph.add_edge(user['id_str'], friend_id)
The data is in MongoDB. Here's the sample of data.
{
"_id" : ObjectId("55e1e425dd232e5962bdfbdf"),
"id_str" : "246483486",
...
"friends_ids" : [
// a bunch of ids
]
}
I tried using betweenness centrality parallel to speed up but it's still super slow. https://networkx.github.io/documentation/latest/examples/advanced/parallel_betweenness.html
"""
Example of parallel implementation of betweenness centrality using the
multiprocessing module from Python Standard Library.
The function betweenness centrality accepts a bunch of nodes and computes
the contribution of those nodes to the betweenness centrality of the whole
network. Here we divide the network in chunks of nodes and we compute their
contribution to the betweenness centrality of the whole network.
"""
def chunks(l, n):
"""Divide a list of nodes `l` in `n` chunks"""
l_c = iter(l)
while 1:
x = tuple(itertools.islice(l_c, n))
if not x:
return
yield x
def _betmap(G_normalized_weight_sources_tuple):
"""Pool for multiprocess only accepts functions with one argument.
This function uses a tuple as its only argument. We use a named tuple for
python 3 compatibility, and then unpack it when we send it to
`betweenness_centrality_source`
"""
return nx.betweenness_centrality_source(*G_normalized_weight_sources_tuple)
def betweenness_centrality_parallel(G, processes=None):
"""Parallel betweenness centrality function"""
p = Pool(processes=processes)
node_divisor = len(p._pool)*4
node_chunks = list(chunks(G.nodes(), int(G.order()/node_divisor)))
num_chunks = len(node_chunks)
bt_sc = p.map(_betmap,
zip([G]*num_chunks,
[True]*num_chunks,
[None]*num_chunks,
node_chunks))
# Reduce the partial solutions
bt_c = bt_sc[0]
for bt in bt_sc[1:]:
for n in bt:
bt_c[n] += bt[n]
return bt_c
print("Computing betweenness centrality for:")
print(nx.info(graph))
start = time.time()
bt = betweenness_centrality_parallel(graph, 2)
print("\t\tTime: %.4F" % (time.time()-start))
print("\t\tBetweenness centrality for node 0: %.5f" % (bt[0]))
The import process from Mongodb to networkx is relatively fast, less than a minute.
TL/DR: Betweenness centrality is a very slow calculation, so you probably want to use an approximate measure by considering a subset of
myk
nodes wheremyk
is some number much less than the number of nodes in the network, but large enough to be statistically meaningful (NetworkX has an option for this:betweenness_centrality(G, k=myk)
.I'm not at all surprised it's taking a long time. Betweenness centrality is a slow calculation. The algorithm used by networkx is
O(VE)
whereV
is the number of vertices andE
the number of edges. In your caseVE = 10^13
. I expect importing the graph to takeO(V+E)
time, so if that is taking long enough that you can tell it's not instantaneous, thenO(VE)
is going to be painful.If a reduced network with 1% of the nodes and 1% of the edges (so 20,000 nodes and 50,000 edges) would take time X, then your desired calculation would take take 10000X. If X is one second, then the new calculation is close to 3 hours, which I think is incredibly optimistic (see my test below). So before you decide there's something wrong with your code, run it on some smaller networks and get an estimate of what the run time should be for your network.
A good alternative is to use an approximate measure. The standard betweenness measure considers every single pair of nodes and the paths between them. Networkx offers an alternative which uses a random sample of just
k
nodes and then finds shortest paths between thosek
nodes and all other nodes in the network. I think this should give a speedup to run inO(kE)
timeSo what you would use is
If you want to have bounds on how accurate your result is, you could do several calls with a smallish value of
k
, make sure that they are relatively close and then take the average result.Here's some of my quick testing of run time, with random graphs of (V,E)=(20,50); (200,500); and (2000,5000)
So on my computer it takes 15 seconds to handle a network that is 0.1% the size of yours. It would take about 15 million seconds to do a network the same size as yours. That's 1.5*10^7 seconds which is a little under half of pi*10^7 seconds. Since pi*10^7 seconds is an incredibly good approximation to the number of seconds in a year, this would take my computer about 6 months.
So you'll want to run with an approximate algorithm.