I'm trying to create a co-occurrence matrix in R on a very large dataset (26M lines) that looks basically like this:
ID Observation
11000 ficus
11112 cherry
11112 ficus
12223 juniper
12223 olive
12223 juniper
12223 ficus
12334 olive
12334 cherry
12334 olive
... ...
And on for a long time. I want to consolidate the observations by ID and generate a co-occurance matrix of observations observed by observer ID. I managed this on a subset of the data but some of the stuff I did "manually" that it wouldn't be practical to do for the entire set. (see code below) I am resisting the urge to use loops because everyone tells me if you use a loop in R you're doing it wrong, but I'm not sure how else to achieve this.
Here's what worked for the smaller set:
tfmat = data.frame(cmsclip$ID[1:100],
cmsclip$tree[1:100] %in% c(cmsclip$tree[1]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[2]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[3]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[4]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[5]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[6]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[7]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[9]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[10]),
cmsclip$tree[1:100] %in% c(cmsclip$tree[11]))
colnames(tfmat) <- c('ID', cmsclip$tree[1:7], cmsclip$tree[9:11])
I just picked the first 10 unique tree names which in the full block of data happened to be cmsclip$tree[1:7], cmsclip$tree[9:11], and only looked at those, on the first 100 observations, generating a TRUE/FALSE matrix for each tree on each line. Then I used split() to split consolidate the data by ID number and summed each column (now representing a specific tree) to see how many of that tree each observer recorded.
testsplit = split(tfmat, tfmat[1])
summed1 <-colSums(testsplit$`ficus`)
summed2 <-colSums(testsplit$`cherry`)
summed3 <-colSums(testsplit$`juniper`)
summed4 <-colSums(testsplit$`pine`)
summed5 <-colSums(testsplit$`olive`)
summed6 <-colSums(testsplit$`elm`)
summed7 <-colSums(testsplit$`rain`)
summed8 <-colSums(testsplit$`redwood`)
summed9 <-colSums(testsplit$`shimpaku`)
summed10 <-colSums(testsplit$`maple`)
The problem with this is I typed each name in by hand and I can't do that with the whole data frame. This is where I'm looking for something better. I think combined each line into a final matrix that just had the number of observations of each tree in a given column where one line represented one ID number.
finmat = data.frame(summed1[2:11],summed2[2:11],summed3[2:11],
summed4[2:11],summed5[2:11],summed6[2:11],
summed7[2:11],summed8[2:11],summed9[2:11],
summed10[2:11],summed11[2:11])
Then I did this...
finmat <- t(finmat)
treenames <- c(cmsclip$tree[1:7], cmsclip$tree[9:11])
colnames(finmat) <- treenames
total_occurrences <- colSums(finmat)
data_matrix <- as.matrix(finmat)
co_occurrence <- t(data_matrix) %*% data_matrix
library(igraph)
graph <- graph.adjacency(co_occurrence,
weighted=TRUE,
mode="undirected",
diag=FALSE)
plot(graph,
vertex.label=names(data),
vertex.size=total_occurrences*5,
edge.width=E(graph)$weight*8)
It worked fine, it's just not scaleable to a much larger set of data (too much manual stuff). Any suggestions for a faster way would be appreciate. Thanks!
EDIT: Ideally I would like to create a matrix where the rows correspond to the ID numbers and the columns to each individual tree, and it would show how many of each tree was observed. e.g.:
ID ficus cherry juniper olive ...
11000 1 0 0 0
11112 1 1 0 0
12223 1 0 2 1
12334 0 1 0 2
...
I should add that I actually ultimately want to do hierarchical clustering on the data using the co-occurrence of observations of the different trees as a way to calculate the "distance" between trees. So any suggestions on how to achieve that would be great. And I could potentially abandon generating the above matrix if there is an easy way to jump from what I have to a distance matrix.