Find connected components of big Graph using Boost

2019-06-03 12:02发布

问题:

I have wrote a code for finding connected component of a very big Graph (80 million edges) but it doesn't work, When the edges number is close to 40 million it crashed.

int main(){
    using namespace boost;
    {
        int node1,node2;
        typedef adjacency_list <vecS, vecS, undirectedS> Graph;
        Graph G;
        std::ifstream infile("pairs.txt");
        std::string line;
        while (std::getline(infile,line))
        {
            std::istringstream iss(line);
            iss >> node1 >> node2;
            add_edge(node1, node2, G);}
            cout <<"writing file"<<endl;
            int  j = 0;
            ofstream out;
            out.open("connected_component.txt");
            std::vector<int> component(num_vertices(G));
            int num = connected_components(G, &component[0]);
            std::vector<int>::size_type i;
            for (i = 0; i != component.size(); ++i){
                out << i << "\t "<<component[i] <<endl;}
                out.close();
            }

Any idea of how can I do this with boost ? or change my Graph data type ?

回答1:

Using randomized graph data, I can run 40 million edges in about 37s (peaking at 4.4GiB of memory according to Massif).

/tmp$ od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 > pairs.txt
/tmp$ time ./test

Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms

real    0m37.339s
user    0m36.078s
sys 0m1.202s

1. Memory allocations

Note however, that I tweaked it by using a vector for the edge list, so that I can reserve the capacity required:

typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, 
         no_property, vecS> Graph;

This

  • enhanced load performance by removing reallocations
  • reduces heap fragmentation

2. Vertex id scaling

An important note, too, is that the storage requirements scale with the number of vertices. More specifically, they scale with the domain of vertices. E.g. loading a file like this:

1 7
2 7
5 6
4 9

will have vastly less memory requirements than

1 70000
2 70000
5 60000
4 90000

In fact, rerunning the above benchmark, with exactly the same input, but only the first line altered from

 47662 60203

into

 476624766 602036020

Results in the following output:

Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @  0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms

real    2m20.318s
user    1m30.224s
sys 0m49.821s

As you can see google's malloc implementation (from gperftools) even warns about exceptionally large allocations and indeed, it runs much slower. (Oh, and the memory usage becomes such that Massif doesn't make it anymore, but I've seen it hit 23GiB in htop).

Full Code

See it Live On Coliru running on 4000 edges:

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>

#include <chrono>

using Clock = std::chrono::high_resolution_clock;

int main()
{
    using namespace boost;
    typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
    Graph G;

    // read edges
    auto start = Clock::now();
    std::ifstream infile("pairs.txt", std::ios::binary);

    std::vector<std::pair<int, int> > as_read;

    int node1, node2;
    while (infile >> node1 >> node2)
        as_read.emplace_back(node1, node2);

    std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // build graph
    G.m_edges.reserve(as_read.size());
    for(auto& pair : as_read)
        add_edge(pair.first,pair.second,G);

    std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // find connected components
    std::vector<int> component(num_vertices(G));
    int num = connected_components(G, &component[0]);

    std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // write output
    std::cout <<"writing file"<<std::endl;

    std::ofstream out;
    out.open("connected_component.txt");
    for (size_t i = 0; i != component.size(); ++i) {
        out << i << "\t "<< component[i] << std::endl; 
    }

    out.close();
    std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}