R finding duplicates in one column and collapsing

2020-02-11 09:00发布

I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case. for example, if I have this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1
3   cg00061679  DAZ4
4   cg00061679  DAZ4

I want to change it to this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1 DAZ4 DAZ4

obviously there is no problem doing this for a single probe using which, and then paste and collapse

ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")

but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?

Thanks in advance

2条回答
Root(大扎)
2楼-- · 2020-02-11 09:01

You can use tapply in base R

data.frame(probes=unique(olap$probes), 
           genes=tapply(olap$genes, olap$probes, paste, collapse=" "))

or use plyr:

library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))

UPDATE

It's probably safer in the first version to do this:

tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)

Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.

查看更多
啃猪蹄的小仙女
3楼-- · 2020-02-11 09:15

Base R aggregate() should work fine for this:

aggregate(genes ~ probes, data = olap, as.vector)
#       probes            genes
# 1 cg00050873            TSPY4
# 2 cg00061679 DAZ1, DAZ4, DAZ4

I prefer as.vector in case I need to do any further work on the data (this stores the genes column as a list, but you can also try aggregate(genes ~ probes, data=test, paste, collapse=" ") if you prefer it to be a character string.

查看更多
登录 后发表回答