I have wrote a code for finding connected component of a very big Graph (80 million edges)
but it doesn't work, When the edges number is close to 40 million it crashed.
int main(){
using namespace boost;
{
int node1,node2;
typedef adjacency_list <vecS, vecS, undirectedS> Graph;
Graph G;
std::ifstream infile("pairs.txt");
std::string line;
while (std::getline(infile,line))
{
std::istringstream iss(line);
iss >> node1 >> node2;
add_edge(node1, node2, G);}
cout <<"writing file"<<endl;
int j = 0;
ofstream out;
out.open("connected_component.txt");
std::vector<int> component(num_vertices(G));
int num = connected_components(G, &component[0]);
std::vector<int>::size_type i;
for (i = 0; i != component.size(); ++i){
out << i << "\t "<<component[i] <<endl;}
out.close();
}
Any idea of how can I do this with boost ? or change my Graph data type ?
Using randomized graph data, I can run 40 million edges in about 37s (peaking at 4.4GiB of memory according to Massif).
/tmp$
od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 > pairs.txt
/tmp$
time ./test
Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms
real 0m37.339s
user 0m36.078s
sys 0m1.202s
1. Memory allocations
Note however, that I tweaked it by using a vector for the edge list, so that I can reserve the capacity required:
typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property,
no_property, vecS> Graph;
This
- enhanced load performance by removing reallocations
- reduces heap fragmentation
2. Vertex id scaling
An important note, too, is that the storage requirements scale with the number of vertices. More specifically, they scale with the domain of vertices. E.g. loading a file like this:
1 7
2 7
5 6
4 9
will have vastly less memory requirements than
1 70000
2 70000
5 60000
4 90000
In fact, rerunning the above benchmark, with exactly the same input, but only the first line altered from
47662 60203
into
476624766 602036020
Results in the following output:
Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @ 0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @ 0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @ 0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms
real 2m20.318s
user 1m30.224s
sys 0m49.821s
As you can see google's malloc implementation (from gperftools) even warns about exceptionally large allocations and indeed, it runs much slower. (Oh, and the memory usage becomes such that Massif doesn't make it anymore, but I've seen it hit 23GiB in htop).
Full Code
See it Live On Coliru running on 4000 edges:
#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>
#include <chrono>
using Clock = std::chrono::high_resolution_clock;
int main()
{
using namespace boost;
typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
Graph G;
// read edges
auto start = Clock::now();
std::ifstream infile("pairs.txt", std::ios::binary);
std::vector<std::pair<int, int> > as_read;
int node1, node2;
while (infile >> node1 >> node2)
as_read.emplace_back(node1, node2);
std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
start = Clock::now();
// build graph
G.m_edges.reserve(as_read.size());
for(auto& pair : as_read)
add_edge(pair.first,pair.second,G);
std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
start = Clock::now();
// find connected components
std::vector<int> component(num_vertices(G));
int num = connected_components(G, &component[0]);
std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
start = Clock::now();
// write output
std::cout <<"writing file"<<std::endl;
std::ofstream out;
out.open("connected_component.txt");
for (size_t i = 0; i != component.size(); ++i) {
out << i << "\t "<< component[i] << std::endl;
}
out.close();
std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}