Correlation Network Implementation

2019-06-04 22:45发布

问题:

I have been working on my graph/network problem, and I think I finally know what I want to do. Now that I am getting into the implementation, I am having issues deciding what libraries to use. The graph itself is pretty simple, each node is labeled by a string, and each each is a probability/correlation coefficient between the two nodes(variables), and is undirected. The operations that I want to perform on the graph are:

  • Inserting new nodes/edges (fast)
  • Finding the all pairs shortest (1/probability) path, and remembering the nodes in the path - probably Johnson's algorithm
  • Constructing the minimum weight Steiner tree for k specific vertices
    • Use Johnson's algorithm to build shortest paths
    • Iterating over the current nodes in the path p, find the shortest route to the remaining nodes in k
  • Looking at the mean degree of the graph
  • Evaluating the betweenness of the nodes
  • Getting the clustering coefficients
  • Finding the modularity of the graph

For many of these, I want to compare the result to the Erdos-Renyi model, testing against it as a null hypothesis. Also, being able to be able to use the statistical mechanics definitions via a Markov Field would be helpful, as then I could calculate correlations between two nodes that are not identical, and ask the graph questions about the entropy, etc. So a good mapping onto a Markov field library of some sort would be useful too.

The crux of the problem at the moment is that I am trying to find a C++ library to work in. I have taken a look at R, but I want something that is going to be more robust and faster. The three libraries that I am considering are:

  • LEMON
    • Easy to use and install
    • Straightforward documentation
    • Has some of the functions I want already
    • Dynamically creating a graph from reading in a text file, and making sure there are no duplicate nodes, is a nightmare that I have not been able to figure out
  • Boost Graph Library
    • Intractable, arcane definitions for objects, and how to use them
    • Documentation does not match what the code does, necessarily
    • Does have many of the algorithms that I want, as well as a very easy way to create a graph from a text file
  • MultiThreaded Graph Library
    • Parallelism already incorporated
    • Reads easier than the BGL
    • Not as many functions
    • Still arcane

Further down the road, I envision the graph living on a distributed network, with distributed storage (hadoop or something). I suspect that the whole graph will not fit into memory, and so I will have to come up with a caching scenario to look at parts of the graph.

What library would people suggest for the problem that I described? Would it be better to just use the BGL, and write my own functions? What about the multi-threaded version? Are there any libraries that lend themselves more readily to the type of work I want to do, especially the quantities I want to compute?

Original Post

Thanks!

Edit1 So I am seriously frustrated by the BGL. I have an adjacency list graph, and I want to run my own version of the Johnson's (or Floyd's, at this point, I am not picky) on the graph, and return the Distance Matrix for me to look at. Except that I can't get it to work. Here is my full code implementation thus far:

using namespace boost;

int main()
{
//Read in the file
std::ifstream datafile("stuff");
if (!datafile)
{
    std::cerr << "No Stuff file" << std::endl;
    return EXIT_FAILURE;
}

//Build the graph
typedef adjacency_list < vecS, vecS, undirectedS, property < vertex_name_t,
    std::string >, property < edge_weight_t, double > > Graph;
Graph g;

//Build the two properties we want, string and double
//Note, you have to nest properties for more
typedef property_map< Graph, vertex_index_t >::type vertex_index_map_t;
vertex_index_map_t vertex_index_map = get(vertex_index, g);
typedef property_map < Graph, vertex_name_t >::type name_map_t;
name_map_t name_map = get(vertex_name, g);
typedef property_map < Graph, edge_weight_t >::type probability_map_t;
probability_map_t probability = get(edge_weight, g);

//Map of of the vertices by string
typedef graph_traits < Graph >::vertex_descriptor Vertex;
typedef std::map < std::string, Vertex > NameVertexMap;
NameVertexMap AllNodes;

//Load the file into the graph
for (std::string line; std::getline(datafile, line);)
{
    char_delimiters_separator < char >sep(false, "", ";");
    tokenizer <> line_toks(line, sep);
    tokenizer <>::iterator i = line_toks.begin();
    std::string conditionA = *i++;
    NameVertexMap::iterator pos;
    bool inserted;
    Vertex u, v;

    boost::tie(pos, inserted) = AllNodes.insert(std::make_pair(conditionA, Vertex()));
    if (inserted)
    {
        u = add_vertex(g);
        name_map[u] = conditionA;
        pos->second = u;
    }
    else
    {
        u = pos->second;
    }

    std::string correlation = *i++;
    std::istringstream incorrelation(correlation);
    double correlate;
    incorrelation >> correlate;

    boost::tie(pos, inserted) = AllNodes.insert(std::make_pair(*i, Vertex()));
    if (inserted) {
        v = add_vertex(g);
        name_map[v] = *i;
        pos->second = v;
    }
    else
    {
        v = pos->second;
    }

    graph_traits < Graph >::edge_descriptor e;
    boost::tie(e, inserted) = add_edge(u, v, g);
    if (inserted)
        probability[e] = 1.0/correlate;
}

typedef boost::graph_traits<Graph>::edge_iterator edge_iter;
std::pair<edge_iter, edge_iter> edgePair;
Vertex u, v;
for(edgePair = edges(g); edgePair.first != edgePair.second; ++edgePair.first)
{
    u = source(*edgePair.first, g);
    v = target(*edgePair.first, g);
    std::cout << "( " << vertex_index_map[u] << ":" << name_map[u] << ", ";
    std::cout << probability[*edgePair.first] << ", ";
    std::cout << vertex_index_map[v] << ":" << name_map[v] << " )" << std::endl;
}  
}

Where the input file is of the format NodeA;correlation;NodeB. The code that I pasted above works, but I get into serious trouble when I attempt to include the johnson_all_pairs_shortest_paths functionality. Really what I want is not only a DistanceMatrix D (which I cannot seem to construct correctly, I want it to be a square matrix of doubles double D[V][V], V = num_vertices(g), but it gives me back that I am not calling the function correctly), but also a list of the nodes that were taken along that path, similar to what the wiki article has for Floyd's Algorithm path reconstruction. Should I just make the attempt to roll my own algorithm(s) for this problem, since I can't figure out if the functionality is there or not (not to mention how to make the function calls)? The documentation for the BGL is as obtuse as the implementation, so I don't really have any modern examples to go on.