Emulate split() with dplyr group_by: return a list

I have a large dataset that chokes split() in R. I am able to use dplyr group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames and similar).

consider a sample dataset:

df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)

returns

I would like to emulate this with group_by (something like group_by(df,V1)) but this returns one, grouped_df. I know that do should be able to help me, but I am unsure about usage (also see link for a discussion.)

Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).

标签： r list split dplyr

4条回答

在下西门庆

2楼-- · 2019-01-08 21:15

You can get a list of data frames from group_by using do as long as you name the new column where the data frames will be stored and then pipe that column into lapply.

listDf = df %>% group_by(V1) %>% do(vals=data.frame(.)) %>% select(vals) %>% lapply(function(x) {(x)})
listDf[[1]]
#[[1]]
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#[[2]]
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#[[3]]
#  V1 V2 V3
#1  c  5  2

0人赞添加讨论(0) 举报

欢心

3楼-- · 2019-01-08 21:16

To 'stick' to dplyr, you can also use plyr instead of split:

library(plyr)

dlply(df, "V1", identity)
#$a
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#$b
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#$c
#  V1 V2 V3
#1  c  5  2

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2019-01-08 21:24

Comparing the base, plyr and dplyr solutions, it still seems the base one is much faster!

library(plyr)
library(dplyr)   

df <- data_frame(Group1=rep(LETTERS, each=1000),
             Group2=rep(rep(1:10, each=100),26), 
             Value=rnorm(26*1000))

microbenchmark(Base=df %>%
             split(list(.$Group2, .$Group1)),
           dplyr=df %>% 
             group_by(Group1, Group2) %>% 
             do(data = (.)) %>% 
             select(data) %>% 
             lapply(function(x) {(x)}) %>% .[[1]],
           plyr=dlply(df, c("Group1", "Group2"), as.tbl),
           times=50)

Gives:

Unit: milliseconds
  expr      min        lq      mean    median        uq       max neval
  Base 12.82725  13.38087  16.21106  14.58810  17.14028  41.67266    50
  dplyr 25.59038 26.66425  29.40503  27.37226  28.85828  77.16062   50
  plyr 99.52911  102.76313 110.18234 106.82786 112.69298 140.97568    50

0人赞添加讨论(0) 举报

Explosion°爆炸

5楼-- · 2019-01-08 21:35

Since dplyr 0.5.0.9000, the shortest solution that uses group_by() is probably to follow do with a pull:

df %>% group_by(V1) %>% do(data=(.)) %>% pull(data)

Note that, unlike split, this doesn't name the resulting list elements. If this is desired, then you would probably want something like

df %>% group_by(V1) %>% do(data = (.)) %>% with( set_names(data, V1) )

To editorialize a little, I agree with the folks saying that split() is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g., split( potentiallylongname, potentiallylongname$V1 )), but the issue is easily sidestepped with the pipe:

df %>% split( .$V1 )

0人赞添加讨论(0) 举报

Emulate split() with dplyr group_by: return a list

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间