This question already has an answer here:
Background: I am in the process of annotating SNPs from a GWAS in an organism without much annotation. I am using the chained tBLASTn table from UCSC along with biomaRt to map each SNP to a probable gene(s).
I have a dataframe that looks like this:
SNP hu_mRNA gene
chr1.111642529 NM_002107 H3F3A
chr1.111642529 NM_005324 H3F3B
chr1.111801684 BC098118 <NA>
chr1.111925084 NM_020435 GJC2
chr1.11801605 AK027740 <NA>
chr1.11801605 NM_032849 C13orf33
chr1.151220354 NM_018913 PCDHGA10
chr1.151220354 NM_018918 PCDHGA5
What I would like to end up with is a single row for each SNP, and comma delimit the genes and hu_mRNAs. Here is what I am after:
SNP hu_mRNA gene
chr1.111642529 NM_002107,NM_005324 H3F3A
chr1.111801684 BC098118,NM_020435 GJC2
chr1.11801605 AK027740,NM_032849 C13orf33
chr1.151220354 NM_018913,NM_018918 PCDHGA10,PCDHGA5
Now I know I can do this with a flick of the wrist in perl, but I really want to do this all in R. Any suggestions?
First set up the test data. Note that we have made the columns to be of
"character"
class rather than"factor"
by usingas.is=TRUE
:Now try this
aggregate
statement:You can use
aggregate
withpaste
for each one andmerge
at the end:This can also be solved using
reshape2
'smelt
anddcast
operations. With this approach,melt
transforms the data to "long" format first, and then the values aredcast
-ed with the same operation,paste(..., collapse = ",")
:Here's a
dplyr
solution, which IHMO is the most readable:The result:
You could do this in one line using
plyr
, as it is a classicsplit-apply-combine
problem. You split usingSNP
, applypaste
withcollapse
and assemble the pieces back into a data frame.If you want to do
data
reshaping in R at theflick of a wrist
, learnplyr
andreshape2
:). Another flick of the wrist solution usingdata.table
, really useful if you are dealing with massive amounts of data.