What is an efficient algorithm for counting the nu

2019-01-17 10:40发布

问题:

What is an efficient algorithm for counting the number of triangles in an undirected graph )(where a graph is a set of vertices and edges)? I've been searching Google and reading through my shelf of textbooks for a few hours each day for three days in a row.

This is for a homework assignment where I need such an algorithm, but developing it doesn't count for anything on the assignment. It is expected that we can simply find such an algorithm from outside resources, but I'm at the end of my rope.

For clarification, a triangle in a graph is a a cycle of length three. The trick is that it needs to work on vertex sets with at most 10,000 nodes.

I'm currently working in C#, but care more about the general approach towards solving this problem than code to copy and paste.

At the high level, my attempts thus far included:

  • A breadth first search that tracked all unique cycles of length three. This seemed like a fine idea to me, but I couldn't get it functional
  • A loop over all the nodes in the graph to see if three vertices shared an edge. This has far too slow of a running time for the larger data sets. O(n^3).

The algorithm itself is part of calculating the clustering coefficient.

回答1:

You will need depth first search. The algorithm will be:

1) For the current node ask all unvisited adjacent nodes

2) for each of those nodes run depth two check to see if a node at depth 2 is your current node from step one

3) mark current node as visited

4) on make each unvisited adjacent node your current node (1 by 1) and run the same algorithm



回答2:

Triangle counting is indeed difficult, and computationally costly. Maybe this is a good starting point for understanding why: Efficient Triangle Counting in Large Graphs via Degree-based Vertex Partitioning.

The appropriate loop should check for each of n nodes against each of their neighbours (n*(n-1)) and continue the cycle to see if n's neighbour's neighbour's neighbours is n: (n*(n-1))(n-1)(n-1), which is almost 10^16 for 10000 n. With a million nodes these loops get silly, but for your 10000 you should have no problem at all if you want to brute-force it :)

You mentioned that you code in C#, and graph (which is available for C) has an excellent algorithm for counting triangles written by Gabor Csardi. It counted 1.3 million triangles in my random graph of 10000 nodes and one million edges in 1.3 seconds on a five year old laptop :) Gabor Csardi would be the guy to ask :)

In terms of different programatical approaches, maybe you should look at the data in which you store your network. If stored in an adjacency matrix, the number of loops is fixed, but in an edge-list of a network of three edges, the number of loops is a multiple of three independent of the number of nodes. You could ask the edge-list for neighbours of a node without having to test every combination of i->j.

I wrote a pedagogical script in R to illustrate the approaches and measure the different algorithms' speed a very basic way. There are lots of speed-issues inherent to the use of R here (the edge-list version is completely swamped by too many edges), but I think the code example should get some ideas flowing on how to think about the speed of brute-forcing triangle-counts. This is in R, and not extremely neat, but well commented. I hope you can break the language barrier.

All the best.

# Counting triangles in a random graph using igraph and two different
# and almost equally stupid approaches looping through the 1) adjacency
# matrix and 2) the edge-list in R.

# Use igraph and these configs
library(igraph)
V <- 100
E <-  1700

# This is the random graph that we will use
g <- erdos.renyi.game(type="gnm", n=V, p=E, directed=FALSE, loops=FALSE)

# igraph has such a fast algorythm. Long live Gabor Csardi!
benchmark <- proc.time()["elapsed"]
       triangle.count <- sum(count_triangles(g)/3)
gabor.Csardi.benchmark <- proc.time()["elapsed"] - benchmark

# For not to big networks we can plot them to get a basic feel
if(length(V(g)) < 100){
       V(g)$size <- 5
       plot(g)
}

# The adjacency matrix approach will have to deal with a crazy
# ammount of looping through pairs of matrix combinations to
# find links:

# We'll loop through each node to check it's participation in triangles
# notice that a triangle ijk will be participated in by nodes
# i, j, and k, and that each of those nodes have two triangular counts.
# All in all the structures ijk, ikj, jik, jki, kij, kji are each counted
# but shall be returned as 1 triangle. We will therefore devide our
# search-result by 6 later on.

# Store our progess in this matrix to look at how we did later on
progress <- matrix(0, nrow=length(V(g)), ncol=8)

# Loop through all nodes to find triangles in an adjacency matrix
benchmark <- proc.time()["elapsed"] # Measure time for this loop
for(i in 1:length(V(g))){
       # Node i has connections to these nodes:
       i.neighbors <- as.vector( neighborhood(g, 1, nodes=i)[[1]] )
       i.neighbors <- setdiff(i.neighbors, c(i)) # i should not be part of its own neighborhood

       # for each i, tri is the number of triangles that i is involved in
       # using any j or any k. For a triangle {1,2,3}, tri will be 2 for
       # i==1, since i is part of both triangles {1,2,3} and {1,3,2}:
       tri <- 0

       for(j in i.neighbors)
       {
              # Node j has connections to these nodes:
              j.neighbors <- as.vector( neighborhood(g, 1, nodes=j)[[1]] )
              j.neighbors <- setdiff(j.neighbors, c(j)) # j should not be part of its own neighborhood

              # Were any of j's neighbors also a neighbor of i?
              k <- intersect(i.neighbors, j.neighbors)

              tri <- tri + length(k)
       }

       # Save our findings to the progress matrix
       progress[i,1] <- tri
       progress[i,7] <- proc.time()["elapsed"] - benchmark
}
progress[,2] <- sapply(1:length(progress[,1]), function(x) sum(progress[,1][1:x]))
progress[,3] <- round(progress[,2] / 6, digits=2)

# The edge-list approach uses a list of all edges in the network to loop through instead
# Here, I suppose, a lot of the extra speed could arise from R being better at looping
# with lapply() and at finding data in a data.frame that the brute-force loop above is.
el <- as.data.frame(as.matrix(get.edgelist(g, )))

# This is ugly. Make the edgelist contain all edges as both i->j and j->i. In
# the igraph object, they are only given as low i to high j by get.edgelist()
  el.rev <- data.frame(el[,2], el[,1])
  names(el) <- names(el.rev) <- c("i","j")
  el <- rbind(el, el.rev)

# these nodes are connected (we'd only need to bother abouth non isolates)
nodes <- sort(unique(c(el$i, el$j)))
tri <- 0

# Loop through connected nodes to find triangles in edge-list
benchmark <- proc.time()["elapsed"] # Measure time for this loop
for(i in nodes){
       i.neighbors <- el[el$i==i,]$j
       # i's neighbors are the $j:s of the edgelist where $i:s are i. 

       k.list <- unlist(lapply(i.neighbors, function(x) intersect(i.neighbors,el[el$i==x, ]$j)))
       # lists nodes that can be a k in an ijk-triangle for each of i's neighboring j:s
       # If 1 has neighbors 2 and 3, el[el$i==x, ]$j) will be first, the neighbors of 2 and then
       # the neighbors of 3. When intersected with the neighbors of i, k:s will be found. If
       # {1,2,3} is a triangle one k will be 3 for {i=1, j=2}, and another k will be 2 for {i=1, j=3}

       # k.list might be NULL
       tri.for.this.i <- (as.numeric(length(k.list)) / 2)
       # Here we devide by two since i can be in a triangle with j and k lik {ijk} and {ikj}
       # We will later have to devide by 3 more, since each triangle will be counted for
       # each node i that we loop through

       # Save the counting to the progress
       tri <- tri.for.this.i + tri
       progress[i,4] <- as.numeric(tri.for.this.i)
       mm <- c(mm, i)
       progress[i,8] <- proc.time()["elapsed"] - benchmark
}
progress[,5] <- sapply(1:length(progress[,4]), function(x) sum(progress[,4][1:x]))
progress[,6] <- round(progress[,5] / 3, digits=2)

# Fix the results into a usable format
results <- data.frame(c("igraph", "adjacency-loop", "edge-loop"),
                      c(triangle.count, max(progress[,3]), max(progress[,6])),
                      c(gabor.Csardi.benchmark, (max(progress[,7]) - min(progress[,7])), (max(progress[,8]) - min(progress[,8]))))
row.names(results) <- c("igraph", "Adjacensy-loop", "Edge-loop")
names(results) <- c("Routine", "Triangle count", "Execution-time")

# Now we have run two approaches of more or less the same thing.
# Not only should the igraph triangle.count, and the two loops
# be identical, but the progress of the two methods should too.
progress[,3] == progress[,6]
plot(progress[,6], type="l", col="blue")
lines(progress[,7], col="green")

# Look at the result:
View(results)


回答3:

Depends on how your graphs are represented.

If you have an adjacency matrix A the number of triangles should be tr(A^3)/6, in other words, 1/6 times the sum of the diagonal elements (the division takes care of the orientation and rotation).

IF you have adjacency lists just start at every node and perform a depth-3 search. Count how often you reach that node -> divide by 6 again.



回答4:

If you do not care about the exact number of triangles, there is a very simple streaming algorithm that provides an unbiased estimator. See for example here for an explanation.



回答5:

As cited here: https://math.stackexchange.com/a/117030

The fastest algorithm known for finding and counting triangles relies on fast matrix product and has an O(nω) time complexity, where ω<2.376 is the fast matrix product exponent. This approach however leads to a θ(n2) space complexity.