Generating a co-occurrance matrix in R on a LARGE

2019-09-12 05:13发布

问题:

I'm trying to create a co-occurrence matrix in R on a very large dataset (26M lines) that looks basically like this:

ID            Observation

11000             ficus
11112             cherry
11112             ficus
12223             juniper
12223             olive
12223             juniper
12223             ficus
12334             olive
12334             cherry
12334             olive
...             ...

And on for a long time. I want to consolidate the observations by ID and generate a co-occurance matrix of observations observed by observer ID. I managed this on a subset of the data but some of the stuff I did "manually" that it wouldn't be practical to do for the entire set. (see code below) I am resisting the urge to use loops because everyone tells me if you use a loop in R you're doing it wrong, but I'm not sure how else to achieve this.

Here's what worked for the smaller set:

tfmat = data.frame(cmsclip$ID[1:100],
              cmsclip$tree[1:100] %in% c(cmsclip$tree[1]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[2]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[3]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[4]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[5]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[6]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[7]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[9]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[10]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[11]))
colnames(tfmat) <- c('ID', cmsclip$tree[1:7], cmsclip$tree[9:11])

I just picked the first 10 unique tree names which in the full block of data happened to be cmsclip$tree[1:7], cmsclip$tree[9:11], and only looked at those, on the first 100 observations, generating a TRUE/FALSE matrix for each tree on each line. Then I used split() to split consolidate the data by ID number and summed each column (now representing a specific tree) to see how many of that tree each observer recorded.

testsplit = split(tfmat, tfmat[1])
summed1 <-colSums(testsplit$`ficus`)
summed2 <-colSums(testsplit$`cherry`)
summed3 <-colSums(testsplit$`juniper`)
summed4 <-colSums(testsplit$`pine`)
summed5 <-colSums(testsplit$`olive`)
summed6 <-colSums(testsplit$`elm`)
summed7 <-colSums(testsplit$`rain`)
summed8 <-colSums(testsplit$`redwood`)
summed9 <-colSums(testsplit$`shimpaku`)
summed10 <-colSums(testsplit$`maple`)

The problem with this is I typed each name in by hand and I can't do that with the whole data frame. This is where I'm looking for something better. I think combined each line into a final matrix that just had the number of observations of each tree in a given column where one line represented one ID number.

finmat = data.frame(summed1[2:11],summed2[2:11],summed3[2:11],
                summed4[2:11],summed5[2:11],summed6[2:11],
                summed7[2:11],summed8[2:11],summed9[2:11],
                summed10[2:11],summed11[2:11])

Then I did this...

finmat <- t(finmat)
treenames <- c(cmsclip$tree[1:7], cmsclip$tree[9:11])
colnames(finmat) <- treenames

total_occurrences <- colSums(finmat)
data_matrix <- as.matrix(finmat)
co_occurrence <- t(data_matrix) %*% data_matrix
library(igraph)
graph <- graph.adjacency(co_occurrence,
                                                weighted=TRUE,
                                                mode="undirected",
                                                diag=FALSE)

plot(graph,
      vertex.label=names(data),
      vertex.size=total_occurrences*5,
      edge.width=E(graph)$weight*8)

It worked fine, it's just not scaleable to a much larger set of data (too much manual stuff). Any suggestions for a faster way would be appreciate. Thanks!

EDIT: Ideally I would like to create a matrix where the rows correspond to the ID numbers and the columns to each individual tree, and it would show how many of each tree was observed. e.g.:

ID      ficus      cherry      juniper      olive      ...
11000       1      0      0       0      
11112       1       1       0       0      
12223       1       0       2       1      
12334       0       1       0       2      
...

I should add that I actually ultimately want to do hierarchical clustering on the data using the co-occurrence of observations of the different trees as a way to calculate the "distance" between trees. So any suggestions on how to achieve that would be great. And I could potentially abandon generating the above matrix if there is an easy way to jump from what I have to a distance matrix.

回答1:

Try this:

tapply(cmsclips$ID, cmsclips$Observation, FUN = c)

here an example:

test = data.frame(id = c(11,12,13,14), obs=c("cat", "dog", "cat", "cat"))
#  id obs
#1 11 cat
#2 12 dog
#3 13 cat
#4 14 cat
tapply(test$id, test$obs, FUN = c)
#$cat
#[1] 11 13 14
#
#$dog
#[1] 12


回答2:

Ok, I just used table() and it pretty much did exactly what I was looking for. Yay for learning R. Now to work on that distance matrix and clustering...

(Thanks, @mts for your help!)