I am calculating triad census
as follows for my undirected network
.
import networkx as nx
G = nx.Graph()
G.add_edges_from(
[('A', 'B'), ('A', 'C'), ('D', 'B'), ('E', 'C'), ('E', 'F'),
('B', 'H'), ('B', 'G'), ('B', 'F'), ('C', 'G')])
from itertools import combinations
#print(len(list(combinations(G.nodes, 3))))
triad_class = {}
for nodes in combinations(G.nodes, 3):
n_edges = G.subgraph(nodes).number_of_edges()
triad_class.setdefault(n_edges, []).append(nodes)
print(triad_class)
It works fine with small networks. However, now I have a bigger network with approximately 4000-8000 nodes. When I try to run my existing code with a network of 1000 nodes, it takes days to run. Is there a more efficient way of doing this?
My current network is mostly sparse. i.e. there are only few connections among the nodes. In that case, can I leave the unconnected nodes and do the computation first and later add the unconnceted nodes to the output?
I am also happy to get approximate answers without calculating every combination.
Example of triad census:
Triad census is dividing the triads (3 nodes) in to the four categories shown in the below figure.
For example consider the network below.
The triad census of the four classes are;
{3: [('A', 'B', 'C')],
2: [('A', 'B', 'D'), ('B', 'C', 'D'), ('B', 'D', 'E')],
1: [('A', 'B', 'E'), ('A', 'B', 'F'), ('A', 'B', 'G'), ('A', 'C', 'D'), ('A', 'C', 'E'), ('A', 'C', 'F'), ('A', 'C', 'G'), ('A', 'D', 'E'), ('A', 'F', 'G'), ('B', 'C', 'E'), ('B', 'C', 'F'), ('B', 'C', 'G'), ('B', 'D', 'F'), ('B', 'D', 'G'), ('B', 'F', 'G'), ('C', 'D', 'E'), ('C', 'F', 'G'), ('D', 'E', 'F'), ('D', 'E', 'G'), ('D', 'F', 'G'), ('E', 'F', 'G')],
0: [('A', 'D', 'F'), ('A', 'D', 'G'), ('A', 'E', 'F'), ('A', 'E', 'G'), ('B', 'E', 'F'), ('B', 'E', 'G'), ('C', 'D', 'F'), ('C', 'D', 'G'), ('C', 'E', 'F'), ('C', 'E', 'G')]}
I am happy to provide more details if needed.
EDIT:
I was able to resolve the memory error
by commenting the line #print(len(list(combinations(G.nodes, 3))))
as suggested in the answer. However, my program is still slow and takes days to run even with a network of 1000 nodes. I am looking for a more efficient way of doing this in python.
I am not limited to networkx
and happy to accept answers using other libraries and languages as well.
As always I am happy to provide more details as needed.
You program most probably crashes when you try to convert all combinations to a list:
print(len(list(combinations(G.nodes, 3))))
. Never do it becausecombinations
returns an iterator that consumes a little amount of memory, but list can easily eat gigabytes of memory.If you have sparse graph, it is more reasonable to find triads in connected components:
nx.connected_components(G)
Networkx has triads submodule but looks like it will not fit you. I already modified the networkx.algorithms.triads code to return triads, not their count. You can find it here. Note that it uses DiGraphs. If you want to use it with undirected graphs, you should convert them to directed first.
The idea is simple: Instead of working on the graph directly I use the adjacency matrix. I thought this would be more efficient, and it seems I was right.
In an adjacency matrix a 1 indicates there is an edge between the two nodes, for example the first row can be read as "There is a link between A and B as well as C"
From there I looked at your four types and found the following:
for type 3 there must be an edge between a N1 and N2, N1 and N3 and between N2 and N3. In the adjacency matrix we can find this by going over each row (where each row represents a node and its connections, this is N1) and find nodes it is connected to (that would be N2). Then, in the row of N2 we check all connected nodes (this is N3) and keep those where there is a positive entry in the row of N1. An example of this is "A, B, C", A has a connection to B. B has a connection to C, and A also has a connection to C
for type 2 it works almost identical to type 3. Except now we want to find a 0 for the N3 column in the row of N1. An example of this is "A, B, D". A has a connection to B, B has a 1 in the D column, but A does not.
for type 1 we just look at the row of N2 and find all columns for which both the N1 row and N2 row have a 0.
lastly, for type 0 look at all columns in the N1 row for which the entry is 0, and then check the rows for those, and find all the columns that have a 0 as well.
This code should work for you. For 1000 nodes it took me about 7 minutes (on a machine with a i7-8565U CPU) which is still relatively slow, but a far cry from the multiple days it currently takes you to run your solution. I have included the example from your pictures so you can verify the results. Your code produces a graph that is different from the example you show below by the way. The example graph in the code and the adjacency matrix both refer to the picture you have included.
The example with 1000 nodes uses networkx.generators.random_graphs.fast_gnp_random_graph. 1000 is the number of nodes, 0.1 is the probability for edge creation, and the seed is just for consistency. I have set the probability for edge creation because you mentioned your graph is sparse.
networkx.linalg.graphmatrix.adjacency_matrix: "If you want a pure Python adjacency matrix representation try networkx.convert.to_dict_of_dicts which will return a dictionary-of-dictionaries format that can be addressed as a sparse matrix."
The dictionary structure has
M
dictionaries (= rows) with up toM
dictionaries nested in them. Note that the nested dictionaries are empty so checking for the existence of the key in them is equivalent to checking for a 1 or 0 as described above.i think using list would be fast insertion than dictionary, as dictionary grows exponentially and will take more time.
Let's check the numbers. Let n be the number of vertices, e the number of edges.
0 triads are in O(n^3)
1 triads are in O(e * n)
2 + 3 triads are in O(e)
To get the 2 + 3 triads:
The next step depends on what the goal is. If you just need the number of 1 and 0 triads, then this is sufficient:
Explanation:
The 1 triads are all connected nodes + 1 unconnected node, so we get the number by computing the number of connected nodes + 1 other node, and subtract the cases where the other node is connected (2 and 3 triads)
The 0 triads is just all combinations of nodes minus the other triads.
If you need to actually list the triads, you are pretty much out of luck because no matter what you do, listing the 0 triads is in O(n^3) and will kill you once the graphs get bigger.
The above algo for 2 + 3 triads is in O(e * max(# neighbors)), the other parts are in O(e + n) for counting the nodes and edges. Much better than O (n^3) which you would need to explicitely list the 0 triads. Listing the 1 triads could still be done in O(e * n).