I've got an input file with a list of ~50000 clusters and presence of a number of factors in each of them (~10 million entries in total), see a smaller example below:
set.seed(1)
x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="")
y = c(
paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="")
)
data = data.frame(cluster=x,factor=y)
With a bit of help from another question, I got it to produce a piechart for co-occurrence of factors like this:
counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse='+'))))
pie(counts[counts>1])
But now I would like to have a venn diagram for the co-occurrence of factors. Ideally, also in a way that can take a threshold for the minimum count for each factor. For example, a venn diagram for the different factors so that each one of them has to be present n>10 in each cluster to be taken into account.
I've tried to find a way to produce the table counts with aggregate, but couldn't make it work.
I've provided two solutions, using two different packages with Venn diagram capabilities. As you expected, both involve an initial step using the aggregate()
function.
I tend to prefer the results from the venneuler
package. It's default label positions aren't ideal, but you could adjust them by having a look at the associated plot
method (possibly using locator()
to select the coordinates).
Solution the 1st:
One possibility is to use venneuler()
in the venneuler
package to draw your Venn diagram.
library(venneuler)
## Modify the "factor" column, by renaming it and converting
## it to a character vector.
levels(data$factor) <- c("a", "b", "c")
data$factor <- as.character(data$factor)
## FUN is an anonymous function that determines which letters are present
## 2 or more times in the cluster and then pastes them together into
## strings of a form that venneuler() expects.
##
inter <- aggregate(factor ~ cluster, data=data,
FUN = function(X) {
tab <- table(X)
names <- names(tab[tab>=2])
paste(sort(names), collapse="&")
})
## Count how many clusters contain each combination of letters
counts <- table(inter$factor)
counts <- counts[names(counts)!=""] # To remove groups with <2 of any letter
# a a&b a&b&c a&c b b&c c
# 19 13 12 14 13 9 12
## Convert to proportions for venneuler()
ps <- counts/sum(counts)
## Calculate the Venn diagram
vd <- venneuler(c(a=ps[["a"]], b = ps[["b"]], c = ps[["c"]],
"a&b" = ps[["a&b"]],
"a&c" = ps[["a&c"]],
"b&c" = ps[["b&c"]],
"a&b&c" = ps[["a&b&c"]]))
## Plot it!
plot(vd)
A few notes about choices I made in writing this code:
I've changed the names of factors from "factor-a"
to "a"
. You can obviously change that back.
I've only required each factor to be present >=2 times (instead of >10) to be counted within each cluster. (That was to demonstrate the code with this small subset of your data.)
If you take a look at the intermediate object counts
, you'll see that it contains an initial unnamed element. That element is the number of clusters that contain fewer than 2 of any letter. You can decide better than I whether or not you want to include those in the calculation of the subsequent ps
('proportions') object.
Solution the 2nd:
Another possibility is to employ vennCounts()
and vennDiagram()
in the Bioconductor package limma
. To download the package, follow the instructions here. Unlike the venneuler
solution above, the overlap in the resultant diagram is not proportional to the actual degree of intersection. Instead, it annotates the diagram with the actual frequencies. (Note that this solution does not involve any edits to the data$factor
column.)
library(limma)
out <- aggregate(factor ~ cluster, data=data, FUN=table)
out <- cbind(out[1], data.frame(out[2][[1]]))
counts <- vennCounts(out[, -1] >= 2)
vennDiagram(counts, names = c("Factor A", "Factor B", "Factor C"),
cex = 1, counts.col = "red")