I have a large code and the aggregation step is the current bottleneck in terms of speed.
In my code I'd like to speed-up the data grouping step to be faster. A SNOTE (simple non trivial example) of my data looks like this:
library(data.table)
a = sample(1:10000000, 50000000, replace = TRUE)
b = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), 50000000, replace = TRUE)
d = sample(c("3m","2m2d2m","3m2d1i3s2d","5m","4m","9m","1m"), 50000000, replace = TRUE)
e = a
dt = data.table(a = a, b = b, d = d, e = e)
system.time(c.dt <- dt[,list(b = paste(b, collapse=""), d = paste(d, collapse=""), e = e[1], by=a)])
user system elapsed
60.107 3.143 63.534
This is quite fast for such large data example but in my case I am still looking for further speed-up. In my case I have multiple cores so I am almost sure there must be a way to use such computational capability.
I am open to changing my data type to a data.frame, or idata.frame objects (in theory idata.frame are supposedly faster than data.frames).
I did some research and seems the plyr package has some parallel capabilities that could be helpful but I am still struggling on how to do it for the grouping I am trying to do. In another SO post they discuss some of these ideas. I am still unsure on how much more I'd achieve with this parallelization since it uses the foreach function. In my experience the foreach function is not a good idea for millions of fast operations because the communication effort between cores ends up slowing down the parallelization effort.
Can you parallelize aggregation with
data.table
? Yes.Is it worth it? NO. This is a key point that the previous answer failed to highlight.
As Matt Dowle explains in data.table and parallel computing, copies ("chunks") need to be made before being distributed when running operations in parallel. This slows things down. In some cases, when you cannot use
data.table
(e.g. running many linear regressions), it is worth splitting up tasks between cores. But not aggregation — at least whendata.table
is involved.In short (and until proven otherwise), aggregate using
data.table
and stop worrying about potential speed increases usingdoMC
.data.table
is already blazing fast compared to anything else available when it comes to aggregation — even if it's not multicore!Here are some benchmarks you can run for yourself comparing
data.table
internal aggregation usingby
withforeach
andmclapply
. The results are listed first.If you have multiple cores available to you, why not leverage the fact that you can quickly filter & group rows in a data.table using its key:
Note that if the number of unique groups (ie
length(unique(a))
) is relatively small, it will be faster to drop the.combine
argument, get the results back in a list, then callrbindlist
on the results. In my testing on two cores & 8GB RAM, the threshold was at about 9,000 unique values. Here is what I used to benchmark: