I have been working on my graph/network problem, and I think I finally know what I want to do. Now that I am getting into the implementation, I am having issues deciding what libraries to use. The graph itself is pretty simple, each node is labeled by a string, and each each is a probability/correlation coefficient between the two nodes(variables), and is undirected. The operations that I want to perform on the graph are:
- Inserting new nodes/edges (fast)
- Finding the all pairs shortest (1/probability) path, and remembering the nodes in the path - probably Johnson's algorithm
- Constructing the minimum weight Steiner tree for k specific vertices
- Use Johnson's algorithm to build shortest paths
- Iterating over the current nodes in the path p, find the shortest route to the remaining nodes in k
- Looking at the mean degree of the graph
- Evaluating the betweenness of the nodes
- Getting the clustering coefficients
- Finding the modularity of the graph
For many of these, I want to compare the result to the Erdos-Renyi model, testing against it as a null hypothesis. Also, being able to be able to use the statistical mechanics definitions via a Markov Field would be helpful, as then I could calculate correlations between two nodes that are not identical, and ask the graph questions about the entropy, etc. So a good mapping onto a Markov field library of some sort would be useful too.
The crux of the problem at the moment is that I am trying to find a C++ library to work in. I have taken a look at R, but I want something that is going to be more robust and faster. The three libraries that I am considering are:
- LEMON
- Easy to use and install
- Straightforward documentation
- Has some of the functions I want already
- Dynamically creating a graph from reading in a text file, and making sure there are no duplicate nodes, is a nightmare that I have not been able to figure out
- Boost Graph Library
- Intractable, arcane definitions for objects, and how to use them
- Documentation does not match what the code does, necessarily
- Does have many of the algorithms that I want, as well as a very easy way to create a graph from a text file
- MultiThreaded Graph Library
- Parallelism already incorporated
- Reads easier than the BGL
- Not as many functions
- Still arcane
Further down the road, I envision the graph living on a distributed network, with distributed storage (hadoop or something). I suspect that the whole graph will not fit into memory, and so I will have to come up with a caching scenario to look at parts of the graph.
What library would people suggest for the problem that I described? Would it be better to just use the BGL, and write my own functions? What about the multi-threaded version? Are there any libraries that lend themselves more readily to the type of work I want to do, especially the quantities I want to compute?
Thanks!
Edit1 So I am seriously frustrated by the BGL. I have an adjacency list graph, and I want to run my own version of the Johnson's (or Floyd's, at this point, I am not picky) on the graph, and return the Distance Matrix for me to look at. Except that I can't get it to work. Here is my full code implementation thus far:
using namespace boost;
int main()
{
//Read in the file
std::ifstream datafile("stuff");
if (!datafile)
{
std::cerr << "No Stuff file" << std::endl;
return EXIT_FAILURE;
}
//Build the graph
typedef adjacency_list < vecS, vecS, undirectedS, property < vertex_name_t,
std::string >, property < edge_weight_t, double > > Graph;
Graph g;
//Build the two properties we want, string and double
//Note, you have to nest properties for more
typedef property_map< Graph, vertex_index_t >::type vertex_index_map_t;
vertex_index_map_t vertex_index_map = get(vertex_index, g);
typedef property_map < Graph, vertex_name_t >::type name_map_t;
name_map_t name_map = get(vertex_name, g);
typedef property_map < Graph, edge_weight_t >::type probability_map_t;
probability_map_t probability = get(edge_weight, g);
//Map of of the vertices by string
typedef graph_traits < Graph >::vertex_descriptor Vertex;
typedef std::map < std::string, Vertex > NameVertexMap;
NameVertexMap AllNodes;
//Load the file into the graph
for (std::string line; std::getline(datafile, line);)
{
char_delimiters_separator < char >sep(false, "", ";");
tokenizer <> line_toks(line, sep);
tokenizer <>::iterator i = line_toks.begin();
std::string conditionA = *i++;
NameVertexMap::iterator pos;
bool inserted;
Vertex u, v;
boost::tie(pos, inserted) = AllNodes.insert(std::make_pair(conditionA, Vertex()));
if (inserted)
{
u = add_vertex(g);
name_map[u] = conditionA;
pos->second = u;
}
else
{
u = pos->second;
}
std::string correlation = *i++;
std::istringstream incorrelation(correlation);
double correlate;
incorrelation >> correlate;
boost::tie(pos, inserted) = AllNodes.insert(std::make_pair(*i, Vertex()));
if (inserted) {
v = add_vertex(g);
name_map[v] = *i;
pos->second = v;
}
else
{
v = pos->second;
}
graph_traits < Graph >::edge_descriptor e;
boost::tie(e, inserted) = add_edge(u, v, g);
if (inserted)
probability[e] = 1.0/correlate;
}
typedef boost::graph_traits<Graph>::edge_iterator edge_iter;
std::pair<edge_iter, edge_iter> edgePair;
Vertex u, v;
for(edgePair = edges(g); edgePair.first != edgePair.second; ++edgePair.first)
{
u = source(*edgePair.first, g);
v = target(*edgePair.first, g);
std::cout << "( " << vertex_index_map[u] << ":" << name_map[u] << ", ";
std::cout << probability[*edgePair.first] << ", ";
std::cout << vertex_index_map[v] << ":" << name_map[v] << " )" << std::endl;
}
}
Where the input file is of the format NodeA;correlation;NodeB. The code that I pasted above works, but I get into serious trouble when I attempt to include the johnson_all_pairs_shortest_paths functionality. Really what I want is not only a DistanceMatrix D (which I cannot seem to construct correctly, I want it to be a square matrix of doubles double D[V][V], V = num_vertices(g), but it gives me back that I am not calling the function correctly), but also a list of the nodes that were taken along that path, similar to what the wiki article has for Floyd's Algorithm path reconstruction. Should I just make the attempt to roll my own algorithm(s) for this problem, since I can't figure out if the functionality is there or not (not to mention how to make the function calls)? The documentation for the BGL is as obtuse as the implementation, so I don't really have any modern examples to go on.