graph-tool - reading edge lists from pandas datafr

2019-07-14 00:25发布

问题:

I'm starting working with graph-tool, importing a list of edges from a pandas dataframe df like:

   node1  node2
0      1      2
1      2      3
2      1      4
3      3      1
4      4      3
5      1      5

So basically a list of directed edges. I'm importing them into graph-tool according to the tutorial with:

from graph_tool.all import *
import pandas as pd
# Read pandas dataframe
df = pd.read_csv('file.csv')
# Define Graph
g = Graph(directed=True)
# Add Edges
g.add_edge_list(df.values)

According to the Documentation of add_edge_list(edge_list): edge_list may be a ndarray of shape (E,2), where E is the number of edges, and each line specifies a (source, target) pair.

Running the above code setting edge_list = df.values, and drawing the graph, I obtained:

which is not a representation of the original edge_list of the dataframe. I tried to set *edge_list* = df.values.tolist() with:

g.add_edge_list(df.values.tolist())

obtaining:

Which actually is the right one. Anyone can reproduce this? The problem here is that I'm working with huge networks (~4*10^6 nodes), and I think that the .tolist() method is going to waste a lot of memory in the process.

EDIT: add code for drawing the graph:

graph_draw(g, vertex_text=g.vertex_index, vertex_font_size=18, output_size=(200, 200), output="graph.png")

回答1:

That's really odd behavior, I've never used graph-tools (always networkx) so I can't reproduce right now, but this might help.

According to the docs edge_list can be an iterator. Which means you could try using comprehension to create a generator out of df.values.tolist() and passing that as edge_list, I don't know if it will speed things up on your (~4*10^6 nodes).

It'd look like this:

g.add_edge_list((item for item in df.values.tolist()))

Example of size difference

import numpy as np
import sys

df = pd.DataFrame(np.random.rand(1000,2)) # example "large" dataframe

print sys.getsizeof(df.values.tolist())
print sys.getsizeof((item for item in df.values.tolist()))

8072 #type list
80 # type generator

Just an idea



回答2:

I can't reproduce this. If I load the data frame from the csv file:

  node1,node2
  1,2
  2,3
  1,4
  3,1
  4,3
  1,5

I get your second figure after calling g.add_edge_list(df.values).



回答3:

This is old, but I noticed that the first graph is what would happen if you read off pairs of vertices from the dataframe in column major order. I imagine this is the source of the strange behavior.