Does the by
function make a list that grows one element at a time?
I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:
> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>
> head(x)
col1 col2 f
1 1 10000001 0
2 2 10000002 0
3 3 10000003 0
4 4 10000004 0
5 5 10000005 5
6 6 10000006 5
Now, a tapply
on one of the columns takes a reasonable amount of time:
> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs
But if I do this:
z = by(x[, 1], x[, "f"], mean)
That doesn't finish anywhere near the same time (I gave up after a minute).
Of course, in the above example, tapply
could be used, but I actually need to process multiple columns together. What is the better way to do this?
Regarding a better way to do this: With 4M rows you should use
data.table
.by
is slower thantapply
because it is wrappingby
. Let's take a look at some benchmarks:tapply
in this situation is more than 3x faster than usingby
UPDATED to include @Roland's great recomendation:
If x$f is a factor, the loss in efficiency between tapply and by is even greater!
Although, notice that they both improve relative to non-factor inputs, while data.table remains approx the same or worse
As for the why, the short answer is in the documentation itself.
let's take a look at the source for
by
(or more specificaly,by.data.frame
):We see immediately that there is still a call to
tapply
plus a lot of extras (including calls todeparse(substitute(.))
and aneval(substitute(.))
both of which are relatively slow). Therefore it makes sense that yourtapply
will be relatively faster than a similar call toby
.