This question already has answers here:
Closed 2 years ago.
I'm trying to do something similar to what's answered here, which gets me 80% of the way. I have a data frame with one ID column and multiple information columns. I'd like to roll up all of the other columns so that there's only one row for each ID, and multiple entries are separated by, for instance, a semicolon. Here's an example of what I have and what I want.
HAVE:
ID info1 info2
1 id101 one first
2 id102 twoA second alias A
3 id102 twoB second alias B
4 id103 threeA third alias A
5 id103 threeB third alias B
6 id104 four fourth
7 id105 five fifth
WANT:
ID info1 info2
1 id101 one first
2 id102 twoA; twoB second alias A; second alias B
3 id103 threeA; threeB third alias A; third alias B
4 id104 four fourth
5 id105 five fifth
Here's the code used to generate those:
have <- data.frame(ID=paste0("id", c(101, 102, 102, 103, 103, 104, 105)),
info1=c("one", "twoA", "twoB", "threeA", "threeB", "four", "five"),
info2=c("first", "second alias A", "second alias B", "third alias A", "third alias B", "fourth", "fifth"),
stringsAsFactors=FALSE)
want <- data_frame(ID=paste0("id", c(101:105)),
info1=c("one", "twoA; twoB", "threeA; threeB", "four", "five"),
info2=c("first", "second alias A; second alias B", "third alias A; third alias B", "fourth", "fifth"),
stringsAsFactors=FALSE)
This question asked basically the same question, but only a single "info" column. I have multiple other columns and would like to do this for all of them.
Bonus points for doing this using dplyr.
Here's an option using summarise_each
(which makes it easy to apply the changes to all columns except the grouping variables) and toString
:
require(dplyr)
have %>%
group_by(ID) %>%
summarise_each(funs(toString))
#Source: local data frame [5 x 3]
#
# ID info1 info2
#1 id101 one first
#2 id102 twoA, twoB second alias A, second alias B
#3 id103 threeA, threeB third alias A, third alias B
#4 id104 four fourth
#5 id105 five fifth
Or, if you want it separated by semicolons, you can use:
have %>%
group_by(ID) %>%
summarise_each(funs(paste(., collapse = "; ")))
Good old aggregate
does this just fine
aggregate(have[,2:3], by=list(have$ID), paste, collapse=";")
Question is: does it scale?
Here's a data.table
solution.
library(data.table)
setDT(have)[, lapply(.SD, paste, collapse = "; "), by = ID]
# ID info1 info2
# 1: id101 one first
# 2: id102 twoA; twoB second alias A; second alias B
# 3: id103 threeA; threeB third alias A; third alias B
# 4: id104 four fourth
# 5: id105 five fifth
Here is SQL solution^1:
library(sqldf)
#Static solution
sqldf("
SELECT ID,
GROUP_CONCAT(info1,';') as info1,
GROUP_CONCAT(info2,';') as info2
FROM have
GROUP BY ID")
#Dynamic solution
concat_cols <- colnames(have)[2:ncol(have)]
group_concat <-
paste(paste0("GROUP_CONCAT(",concat_cols,",';') as ", concat_cols),
collapse = ",")
sqldf(
paste("
SELECT ID,",
group_concat,"
FROM have
GROUP BY ID"))
# Same output for both static and dynamic solutions
# ID info1 info2
# 1 id101 one first
# 2 id102 twoA;twoB second alias A;second alias B
# 3 id103 threeA;threeB third alias A;third alias B
# 4 id104 four fourth
# 5 id105 five fifth
^1 - probably data.table
solution would perform better with millions of rows, luckily we don't have that many genes yet :)
library(stringr)
library(dplyr)
have %>% tbl_df %>% group_by(ID) %>% summarise_each(funs(str_c(., collapse="; ")))
Edit 1: So tbl_df
may not needed and instead of the str_c of the stringr
package you could use paste
(in base
). And what the above does is to group by the ID column and then apply the str_c
(or paste
) function to each remaining column for each group.
Edit 2: Another solution using the data.table package:
library(data.table)
dtbl <- as.data.table(have)
dtbl[,lapply(.SD, function(x) paste(x,collapse=";")), by=ID]
The above may be faster, especially if you set the key:
setkey(dtbl, ID)
"Hybrid" solution: You can use the dplyr
syntax for data.tables! For example:
dtbl %>% tbl_dt %>%
group_by(ID) %>%
summarise_each(funs(paste(., collapse="; ")))