I have a large dataset that chokes split()
in R. I am able to use dplyr
group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df
as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames
and similar).
consider a sample dataset:
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)
returns
$a
V1 V2 V3
1 a 1 2
2 a 2 3
$b
V1 V2 V3
3 b 3 4
4 b 4 2
$c
V1 V2 V3
5 c 5 2
I would like to emulate this with group_by
(something like group_by(df,V1)
) but this returns one, grouped_df
. I know that do
should be able to help me, but I am unsure about usage (also see link for a discussion.)
Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).
To 'stick' to dplyr, you can also use plyr
instead of split
:
library(plyr)
dlply(df, "V1", identity)
#$a
# V1 V2 V3
#1 a 1 2
#2 a 2 3
#$b
# V1 V2 V3
#1 b 3 4
#2 b 4 2
#$c
# V1 V2 V3
#1 c 5 2
Comparing the base, plyr
and dplyr
solutions, it still seems the base one is much faster!
library(plyr)
library(dplyr)
df <- data_frame(Group1=rep(LETTERS, each=1000),
Group2=rep(rep(1:10, each=100),26),
Value=rnorm(26*1000))
microbenchmark(Base=df %>%
split(list(.$Group2, .$Group1)),
dplyr=df %>%
group_by(Group1, Group2) %>%
do(data = (.)) %>%
select(data) %>%
lapply(function(x) {(x)}) %>% .[[1]],
plyr=dlply(df, c("Group1", "Group2"), as.tbl),
times=50)
Gives:
Unit: milliseconds
expr min lq mean median uq max neval
Base 12.82725 13.38087 16.21106 14.58810 17.14028 41.67266 50
dplyr 25.59038 26.66425 29.40503 27.37226 28.85828 77.16062 50
plyr 99.52911 102.76313 110.18234 106.82786 112.69298 140.97568 50
You can get a list of data frames from group_by
using do
as long as you name the new column where the data frames will be stored and then pipe that column into lapply
.
listDf = df %>% group_by(V1) %>% do(vals=data.frame(.)) %>% select(vals) %>% lapply(function(x) {(x)})
listDf[[1]]
#[[1]]
# V1 V2 V3
#1 a 1 2
#2 a 2 3
#[[2]]
# V1 V2 V3
#1 b 3 4
#2 b 4 2
#[[3]]
# V1 V2 V3
#1 c 5 2
Since dplyr 0.5.0.9000
, the shortest solution that uses group_by()
is probably to follow do
with a pull
:
df %>% group_by(V1) %>% do(data=(.)) %>% pull(data)
Note that, unlike split
, this doesn't name the resulting list elements. If this is desired, then you would probably want something like
df %>% group_by(V1) %>% do(data = (.)) %>% with( set_names(data, V1) )
To editorialize a little, I agree with the folks saying that split()
is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g., split( potentiallylongname, potentiallylongname$V1 )
), but the issue is easily sidestepped with the pipe:
df %>% split( .$V1 )