NoSQL Solution for Persisting Graphs at Scale

I'm hooked on using Python and NetworkX for analyzing graphs and as I learn more I want to use more and more data (guess I'm becoming a data junkie :-). Eventually I think my NetworkX graph (which is stored as a dict of dict) will exceed the memory on my system. I know I can probably just add more memory but I was wondering if there was a way to instead integrate NetworkX with Hbase or a similar solution?

I looked around and couldn't really find anything but I also couldn't find anything related to allowing a simple MySQL backend as well.

Is this possible? Does anything exist to allow for connectivity to some kind of persistant storage?

Thanks!

Update: I remember seeing this subject in 'Social Network Analysis for Startups', the author talks about other storage methods(including hbase, s3, etc..) but does not show how to do this or if its possible.

回答1:

There are two general types of containers for storing graphs:

true graph databases: e.g., Neo4J, agamemnon, GraphDB, and AllegroGraph; these not only store a graph but they also understand that a graph is, so for instance, you can query these databases e.g., how many nodes are between the shortest path from node X and node Y?
static graph containers: Twitter's MySQL-adapted FlockDB is the most well-known exemplar here. These DBs can store and retrieve graphs just fine; but to query the graph itself, you have to first retrieve the graph from the DB then use a library (e.g., Python's excellent Networkx) to query the graph itself.

The redis-based graph container i discuss below is in the second category, though apparently redis is also well-suited for containers in the first category as evidenced by redis-graph, a remarkably small python package for implementing a graph database in redis.

redis will work beautifully here.

Redis is a heavy-duty, durable data store suitable for production use, yet it's also simple enough to use for command-line analysis.

Redis is different than other databases in that it has multiple data structure types; the one i would recommend here is the hash data type. Using this redis data structure allows you to very closely mimic a "list of dictionaries", a conventional schema for storing graphs, in which each item in the list is a dictionary of edges keyed to the node from which those edges originate.

You need to first install redis and the python client. The DeGizmo Blog has an excellent "up-and-running" tutorial which includes a step-by-step guid on installing both.

Once redis and its python client are installed, start a redis server, which you do like so:

cd to the directory in which you installed redis (/usr/local/bin on 'nix if you installed via make install); next
type redis-server at the shell prompt then enter

you should now see the server log file tailing on your shell window

>>> import numpy as NP
>>> import networkx as NX

>>> # start a redis client & connect to the server:
>>> from redis import StrictRedis as redis
>>> r1 = redis(db=1, host="localhost", port=6379)

In the snippet below, i have stored a four-node graph; each line below calls hmset on the redis client and stores one node and the edges connected to that node ("0" => no edge, "1" => edge). (In practice, of course, you would abstract these repetitive calls in a function; here i'm showing each call because it's likely easier to understand that way.)

>>> r1.hmset("n1", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
      True

>>> r1.hmset("n2", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
      True

>>> r1.hmset("n3", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
      True

>>> r1.hmset("n4", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
      True

>>> # retrieve the edges for a given node:
>>> r1.hgetall("n2")
      {'n1': '1', 'n2': '0', 'n3': '0', 'n4': '1'}

Now that the graph is persisted, retrieve it from the redis DB as a NetworkX graph.

There are many ways to do this, below did it in two *steps*:

extract the data from the redis database into an adjacency matrix, implemented as a 2D NumPy array; then
convert that directly to a NetworkX graph using a NetworkX built-in function:

reduced to code, these two steps are:

>>> AM = NP.array([map(int, r1.hgetall(node).values()) for node in r1.keys("*")])
>>> # now convert this adjacency matrix back to a networkx graph:
>>> G = NX.from_numpy_matrix(am)

>>> # verify that G in fact holds the original graph:
>>> type(G)
      <class 'networkx.classes.graph.Graph'>
>>> G.nodes()
      [0, 1, 2, 3]
>>> G.edges()
      [(0, 1), (0, 2), (0, 3), (1, 3), (2, 3), (3, 3)]

When you end a redis session, you can shut down the server from the client like so:

>>> r1.shutdown()

redis saves to disk just before it shuts down so this is a good way to ensure all writes were persisted.

So where is the redis DB? It is stored in the default location with the default file name, which is dump.rdb on your home directory.

To change this, edit the redis.conf file (included with the redis source distribution); go to the line starting with:

# The filename where to dump the DB
dbfilename dump.rdb

change dump.rdb to anything you wish, but leave the .rdb extension in place.

Next, to change the file path, find this line in redis.conf:

# Note that you must specify a directory here, not a file name

The line below that is the directory location for the redis database. Edit it so that it recites the location you want. Save your revisions and rename this file, but keep the .conf extension. You can store this config file anywhere you wish, just provide the full path and name of this custom config file on the same line when you start a redis server:

So the next time you start a redis server, you must do it like so (from the shell prompt:

$> cd /usr/local/bin    # or the directory in which you installed redis 

$> redis-server /path/to/redis.conf

Finally, the Python Package Index lists a package specifically for implementing a graph database in redis. The package is called redis-graph and i have not used it.

回答2:

There is a SQLlite3 backed NetworkX implementation called Cloudlight. https://www.assembla.com/spaces/cloudlight/wiki/Tutorial

回答3:

I would be interested to see the best way of using the hard drive. In the past I have made multiple graphs and saved them as .dot files. Then kind of mixed some of them in memory somehow. Not the best solution though.

from random import random
import networkx as nx

def make_graph():
    G=nx.DiGraph()
    N=10
    #make a random graph
    for i in range(N):
        for j in range(i):
            if 4*random()<1:
                G.add_edge(i,j)

    nx.write_dot(G,"savedgraph.dot")
    return G

try:
    G=nx.read_dot("savedgraph.dot")
except:
    G=make_graph() #This will fail if you don't use the same seed but have created the graph in the past. You could use the Singleton design pattern here.
print G.adj