Collapse all columns by an ID column [duplicate]

2020-01-29 06:23发布

站内文章 / 前端开发

14 0

狗以群分

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This question already has answers here:

Collapse / concatenate / aggregate a column to a single comma separated string within each group (3 answers)

Closed 2 years ago.

I'm trying to do something similar to what's answered here, which gets me 80% of the way. I have a data frame with one ID column and multiple information columns. I'd like to roll up all of the other columns so that there's only one row for each ID, and multiple entries are separated by, for instance, a semicolon. Here's an example of what I have and what I want.

HAVE:

     ID  info1          info2
1 id101    one          first
2 id102   twoA second alias A
3 id102   twoB second alias B
4 id103 threeA  third alias A
5 id103 threeB  third alias B
6 id104   four         fourth
7 id105   five          fifth

WANT:

     ID          info1                          info2
1 id101            one                          first
2 id102     twoA; twoB second alias A; second alias B
3 id103 threeA; threeB   third alias A; third alias B
4 id104           four                         fourth
5 id105           five                          fifth

Here's the code used to generate those:

have <- data.frame(ID=paste0("id", c(101, 102, 102, 103, 103, 104, 105)),
                   info1=c("one", "twoA", "twoB", "threeA", "threeB", "four", "five"), 
                   info2=c("first", "second alias A", "second alias B", "third alias A", "third alias B", "fourth", "fifth"),
                   stringsAsFactors=FALSE)
want <- data_frame(ID=paste0("id", c(101:105)),
                   info1=c("one", "twoA; twoB", "threeA; threeB", "four", "five"), 
                   info2=c("first", "second alias A; second alias B", "third alias A; third alias B", "fourth", "fifth"),
                   stringsAsFactors=FALSE)

This question asked basically the same question, but only a single "info" column. I have multiple other columns and would like to do this for all of them.

Bonus points for doing this using dplyr.

回答1:

Here's an option using summarise_each (which makes it easy to apply the changes to all columns except the grouping variables) and toString:

require(dplyr)

have %>%
  group_by(ID) %>%
  summarise_each(funs(toString))

#Source: local data frame [5 x 3]
#
#     ID          info1                          info2
#1 id101            one                          first
#2 id102     twoA, twoB second alias A, second alias B
#3 id103 threeA, threeB   third alias A, third alias B
#4 id104           four                         fourth
#5 id105           five                          fifth

Or, if you want it separated by semicolons, you can use:

have %>%
  group_by(ID) %>%
  summarise_each(funs(paste(., collapse = "; ")))

回答2:

Good old aggregate does this just fine

aggregate(have[,2:3], by=list(have$ID), paste, collapse=";")

Question is: does it scale?

回答3:

Here's a data.table solution.

library(data.table)
setDT(have)[, lapply(.SD, paste, collapse = "; "), by = ID]
#       ID          info1                          info2
# 1: id101            one                          first
# 2: id102     twoA; twoB second alias A; second alias B
# 3: id103 threeA; threeB   third alias A; third alias B
# 4: id104           four                         fourth
# 5: id105           five                          fifth

回答4:

Here is SQL solution^1:

library(sqldf)
#Static solution
sqldf("
SELECT ID,
       GROUP_CONCAT(info1,';') as info1,
       GROUP_CONCAT(info2,';') as info2
FROM have
GROUP BY ID")

#Dynamic solution
concat_cols <- colnames(have)[2:ncol(have)]
group_concat <-
  paste(paste0("GROUP_CONCAT(",concat_cols,",';') as ", concat_cols),
        collapse = ",")
sqldf(
  paste("
      SELECT ID,",
      group_concat,"
      FROM have
      GROUP BY ID"))

# Same output for both static and dynamic solutions
#      ID         info1                         info2
# 1 id101           one                         first
# 2 id102     twoA;twoB second alias A;second alias B
# 3 id103 threeA;threeB   third alias A;third alias B
# 4 id104          four                        fourth
# 5 id105          five                         fifth

^1 - probably data.table solution would perform better with millions of rows, luckily we don't have that many genes yet :)

回答5:

library(stringr)
library(dplyr)
have %>% tbl_df %>% group_by(ID) %>% summarise_each(funs(str_c(., collapse="; ")))

Edit 1: So tbl_df may not needed and instead of the str_c of the stringr package you could use paste (in base). And what the above does is to group by the ID column and then apply the str_c (or paste) function to each remaining column for each group.

Edit 2: Another solution using the data.table package:

library(data.table)
dtbl <- as.data.table(have)
dtbl[,lapply(.SD, function(x) paste(x,collapse=";")), by=ID]

The above may be faster, especially if you set the key:

setkey(dtbl, ID)

"Hybrid" solution: You can use the dplyr syntax for data.tables! For example:

dtbl %>% tbl_dt %>%
     group_by(ID) %>% 
     summarise_each(funs(paste(., collapse="; ")))

标签： r dplyr

狗以群分

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Collapse all columns by an ID column [duplicate]

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮