Big graph in memory

2019-09-20 10:38发布

问题:

I want to record all used ports within huge pcaps. There are 65535 ports available, and each port is able to talk each other port: 65535 x 65535 links in total

The matrix will be very sparse (many 0 entries). Additionally, I think the edges don't have to be directed, so Port1->Port2 may be added to Port2->Port1 (which reduces our amount of values to 65535 * 65536 / 2). How would you store this using python? In numpy? What will be the estimated amount of memory consumption for this?

Afterwards, I want to find the greatest sum for one port and pop() it (the whole row and column while). This means, i want to find e.g. that Port1 was used 500 times (100 times from Port2 to Port1, 300 times from Port3 to Port1, Port4 to Port1 100times)...

Graphically spoken, I want to have 65535 nodes that could be connected with each other. Then I want to find the node that has the highest sum of values on connected edges. Afterwards, I want to pop the node (and delete the corresponding edges, which will decrease the sum of other nodes).

Thanks!

回答1:

In Python, and depending on how sparse is sparse, a dict-of-dicts will handle this quite well.

connections = { ..., 8080: { 4545:17, 20151:3, ...}, ...}

If I have understood what you are doing correctly, then the count of connections to port p is

count = sum( connections[8080].values() )

removing port p is

del connections[p]
for conn in connections.values():  # edit, bug fixed.
    if p in conn: 
         del conn[p]

If you want to try to save memory by storing only half the pairs, then simplicity suffers greatly.



回答2:

Look into the adjacency list representation of Graph, it will most probably suits your needs.

However, a graph containing 65535 vertices is not that big. Even if you cannot represent it with a simple matrix.

The memory consumption is O(E+V) with V number of vertices (65535) and E number of edges (on a sparse graph, it has the same magnitude order than V).