I have a large dataset that chokes split()
in R. I am able to use dplyr
group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df
as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames
and similar).
consider a sample dataset:
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
listDf = split(df,df$V1)
returns
$a
V1 V2 V3
1 a 1 2
2 a 2 3
$b
V1 V2 V3
3 b 3 4
4 b 4 2
$c
V1 V2 V3
5 c 5 2
I would like to emulate this with group_by
(something like group_by(df,V1)
) but this returns one, grouped_df
. I know that do
should be able to help me, but I am unsure about usage (also see link for a discussion.)
Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).
You can get a list of data frames from
group_by
usingdo
as long as you name the new column where the data frames will be stored and then pipe that column intolapply
.To 'stick' to dplyr, you can also use
plyr
instead ofsplit
:Comparing the base,
plyr
anddplyr
solutions, it still seems the base one is much faster!Gives:
Since
dplyr 0.5.0.9000
, the shortest solution that usesgroup_by()
is probably to followdo
with apull
:Note that, unlike
split
, this doesn't name the resulting list elements. If this is desired, then you would probably want something likeTo editorialize a little, I agree with the folks saying that
split()
is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g.,split( potentiallylongname, potentiallylongname$V1 )
), but the issue is easily sidestepped with the pipe: