Emulate split() with dplyr group_by: return a list

2019-01-08 21:21发布

问题:

I have a large dataset that chokes split() in R. I am able to use dplyr group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames and similar).

consider a sample dataset:

df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)

returns

$a
   V1 V2 V3
 1  a  1  2
 2  a  2  3

$b
   V1 V2 V3
 3  b  3  4
 4  b  4  2

$c
   V1 V2 V3
 5  c  5  2

I would like to emulate this with group_by (something like group_by(df,V1)) but this returns one, grouped_df. I know that do should be able to help me, but I am unsure about usage (also see link for a discussion.)

Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).

回答1:

To 'stick' to dplyr, you can also use plyr instead of split:

library(plyr)

dlply(df, "V1", identity)
#$a
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#$b
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#$c
#  V1 V2 V3
#1  c  5  2


回答2:

Comparing the base, plyr and dplyr solutions, it still seems the base one is much faster!

library(plyr)
library(dplyr)   

df <- data_frame(Group1=rep(LETTERS, each=1000),
             Group2=rep(rep(1:10, each=100),26), 
             Value=rnorm(26*1000))

microbenchmark(Base=df %>%
             split(list(.$Group2, .$Group1)),
           dplyr=df %>% 
             group_by(Group1, Group2) %>% 
             do(data = (.)) %>% 
             select(data) %>% 
             lapply(function(x) {(x)}) %>% .[[1]],
           plyr=dlply(df, c("Group1", "Group2"), as.tbl),
           times=50) 

Gives:

Unit: milliseconds
  expr      min        lq      mean    median        uq       max neval
  Base 12.82725  13.38087  16.21106  14.58810  17.14028  41.67266    50
  dplyr 25.59038 26.66425  29.40503  27.37226  28.85828  77.16062   50
  plyr 99.52911  102.76313 110.18234 106.82786 112.69298 140.97568    50


回答3:

You can get a list of data frames from group_by using do as long as you name the new column where the data frames will be stored and then pipe that column into lapply.

listDf = df %>% group_by(V1) %>% do(vals=data.frame(.)) %>% select(vals) %>% lapply(function(x) {(x)})
listDf[[1]]
#[[1]]
#  V1 V2 V3
#1  a  1  2
#2  a  2  3

#[[2]]
#  V1 V2 V3
#1  b  3  4
#2  b  4  2

#[[3]]
#  V1 V2 V3
#1  c  5  2


回答4:

Since dplyr 0.5.0.9000, the shortest solution that uses group_by() is probably to follow do with a pull:

df %>% group_by(V1) %>% do(data=(.)) %>% pull(data)

Note that, unlike split, this doesn't name the resulting list elements. If this is desired, then you would probably want something like

df %>% group_by(V1) %>% do(data = (.)) %>% with( set_names(data, V1) )

To editorialize a little, I agree with the folks saying that split() is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g., split( potentiallylongname, potentiallylongname$V1 )), but the issue is easily sidestepped with the pipe:

df %>% split( .$V1 )


标签: r list split dplyr