In my research work, I normally deal with big 4D arrays (20-200 millions of elements). I'm trying to improve the computational speed of my calculations looking for an optimal trade-off between speed and simplicity. I've already did some step forward thanks to SO (see here and here)
Now, I'm trying to exploit the latest packages like data.table
and plyr
.
Let's start with something like:
D = c(100, 1000, 8) #x,y,t
d = array(rnorm(prod(D)), dim = D)
I'd like to get for each x
(first dimension) and y
(second dimension) the values of t
that are above the 90th percentile. Let's do that with base R:
system.time(
q1 <- apply(d, c(1,2), function(x) {
return(x >= quantile(x, .9, names = F))
})
)
On my Macbook it's about ten seconds. And I get back an array as:
> dim(q1)
[1] 8 100 1000
(apply
strangely change the order of the dimensions, anyway I don't care now). Now I can melt
(reshape2
package) my array and use it into data.table
:
> d_m = melt(d)
> colnames(d_m) = c('x', 'y', 't', 'value')
> d_t = data.table(d_m)
Then I do some data.table "magic":
system.time({
q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q]
})
The computation now takes slightly less than 10 seconds. Now I want to try with plyr
and ddply
:
system.time({
q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F))
})
Now, it takes 60 seconds. If I move to dplyr
I can do the same calculation again in ten seconds.
However, my question is the following: what would you do to do the same calculation in a faster way? If I consider a larger matrix (say 20 times bigger) I obtain a faster computation using data.table wrt the apply
function but however at the same order of magnitude (14 minutes vs 10 minutes).
Any comment is really appreciated...
EDIT
I've implemented the quantile function in c++ using Rcpp
speeding up the computation of eight times.