I have a data frame with two columns contacting character strings. in one column (named probes
) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes
) into a single case.
for example, if I have this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4
I want to change it to this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4
obviously there is no problem doing this for a single probe using which, and then paste and collapse
ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")
but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?
Thanks in advance
You can use
tapply
in base Ror use plyr:
UPDATE
It's probably safer in the first version to do this:
Just in case unique gives the probes in a different order to
tapply
. Personally I would always useddply
.Base R
aggregate()
should work fine for this:I prefer
as.vector
in case I need to do any further work on the data (this stores thegenes
column as alist
, but you can also tryaggregate(genes ~ probes, data=test, paste, collapse=" ")
if you prefer it to be a character string.