R: igraph, community detection, edge.betweenness m

2020-05-21 05:11发布

问题:

I've a relatively large graph with Vertices: 524 Edges: 1125, of real world transactions. The edges are directed and have a weight (inclusion is optional). I'm trying investigate the various communities within the graph and essentially need a method which:

-Calculates all possible communities

-Calculates the optimum number of communities

-Returns the members/# of members of each (optimum) community

So far I've managed to pull together the following code which plots a color coded graph corresponding to the various communities, however I've no idea how to control the number of communities (i.e plot the top 5 communities with the highest membership) or list the members of a particular community.

library(igraph)
edges <- read.csv('http://dl.dropbox.com/u/23776534/Facebook%20%5BEdges%5D.csv')
all<-graph.data.frame(edges)
summary(all)

all_eb <- edge.betweenness.community(all)
mods <- sapply(0:ecount(all), function(i) {
all2 <- delete.edges(all, all_eb$removed.edges[seq(length=i)])
cl <- clusters(all2)$membership
modularity(all, cl)
})


plot(mods, type="l")

all2<-delete.edges(all, all_eb$removed.edges[seq(length=which.max(mods)-1)])

V(all)$color=clusters(all2)$membership

all$layout <- layout.fruchterman.reingold(all,weight=V(all)$weigth)

plot(all, vertex.size=4, vertex.label=NA, vertex.frame.color="black", edge.color="grey",
edge.arrow.size=0.1,rescale=TRUE,vertex.label=NA, edge.width=.1,vertex.label.font=NA)

Because the edge betweenness method performed so poorly I tried again using the walktrap method:

all_wt<- walktrap.community(all, steps=6,modularity=TRUE,labels=TRUE)
all_wt_memb <- community.to.membership(all, all_wt$merges, steps=which.max(all_wt$modularity)-1)


colbar <- rainbow(20)
col_wt<- colbar[all_wt_memb$membership+1]

l <- layout.fruchterman.reingold(all, niter=100)
plot(all, layout=l, vertex.size=3, vertex.color=col_wt, vertex.label=NA,edge.arrow.size=0.01,
                    main="Walktrap Method")
all_wt_memb$csize
[1] 176  13 204  24   9 263  16   2   8   4  12   8   9  19  15   3   6   2   1

19 clusters - Much better!

Now say I had a "known cluster" with a list of its members and and wanted to check each of the observed clusters for the presence of members from the "known cluster". Returning the percentage of members found. Unable to finish the following??

list<-read.csv("http://dl.dropbox.com/u/23776534/knownlist.csv")
ength(all_wt_memb$csize) #19

for(i in 1:length(all_wt_memb$csize))
{

match((V(all)[all_wt_memb$membership== i]),list)

}  

回答1:

A couple of these questions can be discovered by closely looking at the documentation of the functions you're using. For instance, the documentation of clusters, in the "Values" section, describes what will be returned from the function, a couple of which answer your questions. Documentation aside, you can always use the str function to analyze the make-up of any particular object.

That being said, to get the members or numbers of members in a particular community, you can look at the membership object returned by the clusters function (which you're already using to assign color). So something like:

summary(clusters(all2)$membership)

would describe the IDs of the clusters that are being used. In the case of your sample data, it looks like you have clusters with the IDs ranging from 0 to 585, for 586 clusters in total. (Note that you won't be able to display those very accurately using the coloring scheme you're currently using.)

To determine the number of vertices in each cluster, you can look at the csize component also returned by clusters. In this case, it's a vector of length 586, storing one size for each cluster calculated. So you can use

clusters(all2)$csize

to get the list of sizes of your clusters. Be warned that your clusterIDs, as previously mentioned, start from 0 ("zero-indexed") whereas R vectors start from 1 ("one-indexed"), so you'll need to shift these indices by one. For instance, clusters(all2)$csize[5] returns the size of the cluster with the ID of 4.

To list the vertices in any cluster, you just want to find which IDs in the membership component previously mentioned match up to the cluster in question. So if I want to find the vertices in cluster #128 (there are 21 of these, according to clusters(all2)$csize[129]), I could use:

which(clusters(all2)$membership == 128)
length(which(clusters(all2)$membership == 128)) #21

and to retrieve the vertices in that cluster, I can use the V function and pass in the indices which I just computed which are a member of that cluster:

> V(all2)[clusters(all2)$membership == 128]
Vertex sequence:
 [1] "625591221 - Clare Clancy"           
 [2] "100000283016052 - Podge Mooney"     
 [3] "100000036003966 - Jennifer Cleary"  
 [4] "100000248002190 - Sarah Dowd"       
 [5] "100001269231766 - LirChild Surfwear"
 [6] "100000112732723 - Stephen Howard"   
 [7] "100000136545396 - Ciaran O Hanlon"  
 [8] "1666181940 - Evion Grizewald"       
 [9] "100000079324233 - Johanna Delaney"  
[10] "100000097126561 - Órlaith Murphy"   
[11] "100000130390840 - Julieann Evans"   
[12] "100000216769732 - Steffan Ashe"     
[13] "100000245018012 - Tom Feehan"       
[14] "100000004970313 - Rob Sheahan"      
[15] "1841747558 - Laura Comber"          
[16] "1846686377 - Karen Ni Fhailliun"    
[17] "100000312579635 - Anne Rutherford"  
[18] "100000572764945 - Lit Đ Jsociety"   
[19] "100003033618584 - Fall Ball"        
[20] "100000293776067 - James O'Sullivan" 
[21] "100000104657411 - David Conway"

That would cover the basic igraph questions you had. The other questions are more graph-theory related. I don't know of a way to supervise the number of clusters to be created using iGraph, but someone may be able to point you to a package which is able to do that. You may have more success posting that as a separate question, either here or in another venue.

Regarding your first points of wanting to iterate through all possible communities, I think you'll find that to be unfeasible for a graph of significant size. The number of possible arrangements of the membership vector for 5 different clusters would be 5^n, where n is the size of the graph. If you want to find "all possible communities", that number will actually be O(n^n), if my mental math is correct. Essentially, it would be impossible to calculate that exhaustively over any reasonably size network, even given massive computational resources. So I think you'll be better off using some sort of intelligence/optimization for determining the number of communities represented in your graph, as the clusters function does.



回答2:

In regard to "how to control the number of communities" in OPs question, I use the cut_at function on the communities to cut the resulting hierarchical structure into a desired number of groups. I hope someone can confirm that I am doing something sane. Namely, consider the following:

#Generate graph
adj.mat<- matrix(,nrow=200, ncol=200) #empty matrix
set.seed(2) 

##populate adjacency matrix
for(i in 1:200){adj.mat[i,sample(rep(1:200), runif(1,1,100))]<-1}
adj.mat[which(is.na(adj.mat))] <-0

for(i in 1:200){
  adj.mat[i,i]<-0
}

G<-graph.adjacency(adj.mat, mode='undirected')
plot(G, vertex.label=NA)

##Find clusters
walktrap.comms<- cluster_walktrap(G, steps=10)
max(walktrap.comms$membership) #43

  [1]  6 34 13  1 19 19  3  9 20 29 12 26  9 28  9  9  2 14 13 14 27  9 33 17 22 23 23 10 17 31  9 21  2  1
 [35] 33 23  3 26 22 29  4 16 24 22 25 31 23 23 13 30 35 27 25 15  6 14  9  2 16  7 23  4 18 10 10 22 27 27
 [69] 23 31 27 32 36  8 23  6 23 14 19 22 19 37 27  6 27 22  9 14  4 22 14 32 33 27 26 14 21 27 22 12 20  7
[103] 14 26 38 39 26  3 14 23 22 14 40  9  5 19 29 31 26 26  2 19  6  9  1  9 23  4 14 11  9 22 23 41 10 27
[137] 22 18 26 14  8 15 27 10  5 33 21 28 23 22 13  1 22 24 14 18  8  2 18  1 27 12 22 34 13 27  3  5 27 25
[171]  1 27 13 34  8 10 13  5 17 17 25  6 19 42 31 13 30 32 15 30  5 11  9 25  6 33 18 33 43 10

Now, note that there are 43 groups but we want coarser cuts hence, examine the dendrogram:

plot(as.hclust(walktrap.comms), label=F)

And cut based on it. I arbitrarily chose 6 cuts but nevertheless, you now have coarser clusters

cut_at(walktrap.comms, no=6)

  [1] 4 2 5 4 5 5 3 5 3 4 3 5 5 3 5 5 3 1 5 1 1 5 1 6 1 1 1 4 6 5 5 2 3 4 1 1 3 5 1 4 6 6 3 1 5 5 1 1 5 4 3 1
 [53] 5 2 4 1 5 3 6 3 1 6 6 4 4 1 1 1 1 5 1 4 3 3 1 4 1 1 5 1 5 2 1 4 1 1 5 1 6 1 1 4 1 1 5 1 2 1 1 3 3 3 1 5
[105] 3 3 5 3 1 1 1 1 3 5 2 5 4 5 5 5 3 5 4 5 4 5 1 6 1 3 5 1 1 1 4 1 1 6 5 1 3 2 1 4 2 1 2 3 1 1 5 4 1 3 1 6
[157] 3 3 6 4 1 3 1 2 5 1 3 2 1 5 4 1 5 2 3 4 5 2 6 6 5 4 5 3 5 5 4 4 2 4 2 3 5 5 4 1 6 1 2 4