I have a large data frame looking similar to this:
df <- data.frame(dive=factor(sample(c("dive1","dive2"),10,replace=TRUE)),speed=runif(10))
> df
dive speed
1 dive1 0.80668490
2 dive1 0.53349584
3 dive2 0.07571784
4 dive2 0.39518628
5 dive1 0.84557955
6 dive1 0.69121443
7 dive1 0.38124950
8 dive2 0.22536126
9 dive1 0.04704750
10 dive2 0.93561651
My goal is to average the values of one column when another column is equal to a certain value and repeat this for all values. i.e. in the example above I would like to return an average for the column speed
for every unique value of the column dive
. So when dive==dive1
, the average for speed
is this and so on for each value of dive
.
2015 update with dplyr:
There are many ways to do this in R. Specifically,
by
,aggregate
,split
, andplyr
,cast
,tapply
,data.table
,dplyr
, and so forth.Broadly speaking, these problems are of the form split-apply-combine. Hadley Wickham has written a beautiful article that will give you deeper insight into the whole category of problems, and it is well worth reading. His
plyr
package implements the strategy for general data structures, anddplyr
is a newer implementation performance tuned for data frames. They allow for solving problems of the same form but of even greater complexity than this one. They are well worth learning as a general tool for solving data manipulation problems.Performance is an issue on very large datasets, and for that it is hard to beat solutions based on
data.table
. If you only deal with medium-sized datasets or smaller, however, taking the time to learndata.table
is likely not worth the effort.dplyr
can also be fast, so it is a good choice if you want to speed things up, but don't quite need the scalability ofdata.table
.Many of the other solutions below do not require any additional packages. Some of them are even fairly fast on medium-large datasets. Their primary disadvantage is either one of metaphor or of flexibility. By metaphor I mean that it is a tool designed for something else being coerced to solve this particular type of problem in a 'clever' way. By flexibility I mean they lack the ability to solve as wide a range of similar problems or to easily produce tidy output.
Examples
base
functionstapply
:aggregate
:aggregate
takes in data.frames, outputs data.frames, and uses a formula interface.by
:In its most user-friendly form, it takes in vectors and applies a function to them. However, its output is not in a very manipulable form.:
To get around this, for simple uses of
by
theas.data.frame
method in thetaRifx
library works:split
:As the name suggests, it performs only the "split" part of the split-apply-combine strategy. To make the rest work, I'll write a small function that uses
sapply
for apply-combine.sapply
automatically simplifies the result as much as possible. In our case, that means a vector rather than a data.frame, since we've got only 1 dimension of results.External packages
data.table:
dplyr
:plyr
(the pre-cursor ofdplyr
)Here's what the official page has to say about
plyr
:In other words, if you learn one tool for split-apply-combine manipulation it should be
plyr
.reshape2:
The
reshape2
library is not designed with split-apply-combine as its primary focus. Instead, it uses a two-part melt/cast strategy to perform a wide variety of data reshaping tasks. However, since it allows an aggregation function it can be used for this problem. It would not be my first choice for split-apply-combine operations, but its reshaping capabilities are powerful and thus you should learn this package as well.Benchmarks
10 rows, 2 groups
As usual,
data.table
has a little more overhead so comes in about average for small datasets. These are microseconds, though, so the differences are trivial. Any of the approaches works fine here, and you should choose based on:plyr
is always worth learning for its flexibility;data.table
is worth learning if you plan to analyze huge datasets;by
andaggregate
andsplit
are all base R functions and thus universally available)10 million rows, 10 groups
But what if we have a big dataset? Let's try 10^7 rows split over ten groups.
Then
data.table
ordplyr
using operating ondata.table
s is clearly the way to go. Certain approaches (aggregate
anddcast
) are beginning to look very slow.10 million rows, 1,000 groups
If you have more groups, the difference becomes more pronounced. With 1,000 groups and the same 10^7 rows:
So
data.table
continues scaling well, anddplyr
operating on adata.table
also works well, withdplyr
ondata.frame
close to an order of magnitude slower. Thesplit
/sapply
strategy seems to scale poorly in the number of groups (meaning thesplit()
is likely slow and thesapply
is fast).by
continues to be relatively efficient--at 5 seconds, it's definitely noticeable to the user but for a dataset this large still not unreasonable. Still, if you're routinely working with datasets of this size,data.table
is clearly the way to go - 100% data.table for the best performance ordplyr
withdplyr
usingdata.table
as a viable alternative.