Operations on mult-dimensional arrays in R: apply

In my research work, I normally deal with big 4D arrays (20-200 millions of elements). I'm trying to improve the computational speed of my calculations looking for an optimal trade-off between speed and simplicity. I've already did some step forward thanks to SO (see here and here)

Now, I'm trying to exploit the latest packages like data.table and plyr.

Let's start with something like:

D = c(100, 1000, 8) #x,y,t
d = array(rnorm(prod(D)), dim = D)

I'd like to get for each x (first dimension) and y (second dimension) the values of t that are above the 90th percentile. Let's do that with base R:

system.time(
    q1 <- apply(d, c(1,2), function(x) {
        return(x >= quantile(x, .9, names = F))
        })
)

On my Macbook it's about ten seconds. And I get back an array as:

> dim(q1)
[1]    8  100 1000

(apply strangely change the order of the dimensions, anyway I don't care now). Now I can melt (reshape2 package) my array and use it into data.table:

> d_m = melt(d)
> colnames(d_m) = c('x', 'y', 't', 'value')
> d_t = data.table(d_m)

Then I do some data.table "magic":

system.time({
    q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q]
})

The computation now takes slightly less than 10 seconds. Now I want to try with plyr and ddply:

system.time({
    q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F))
})

Now, it takes 60 seconds. If I move to dplyr I can do the same calculation again in ten seconds.

However, my question is the following: what would you do to do the same calculation in a faster way? If I consider a larger matrix (say 20 times bigger) I obtain a faster computation using data.table wrt the apply function but however at the same order of magnitude (14 minutes vs 10 minutes). Any comment is really appreciated...

EDIT

I've implemented the quantile function in c++ using Rcpp speeding up the computation of eight times.

As suggested by @roland, one possible solution to speed up the code was to implement a faster version of quantile function. I spent one hour to learn how to do that using Rcpp and the running time decreased eight times. I've implemented the type 7 version of the quantile algorithm (default choice). We are still far from the MATLAB performance (discussed here) but in my case this is an impressive step forward. I am not proud of the Rcpp code I have written so far, I didn't have the time to polish it. Anyway, it works (I checked the results with the R function) and so if you are interested you can download it from here.