I have a data table with nrow being around a million or two and ncol of about 200.
Each entry in a row has a coordinate associated with it.
Tiny portion of the data:
[1,] -2.80331471 -0.8874522 -2.34401863 -3.811584 -2.1292443
[2,] 0.03177716 0.2588624 0.82877467 1.955099 0.6321881
[3,] -1.32954665 -0.5433407 -2.19211837 -2.342554 -2.2142461
[4,] -0.60771429 -0.9758734 0.01558774 1.651459 -0.8137684
Coordinates for the first 4 rows:
9928202 9928251 9928288 9928319
What I would like is a function that given the data and window-size would return a data table of the same size with a mean sliding window applied on each column. Or in other words - for each row entry i it would find entries with coordinates between coords[i]-windsize and coords[i]+windsize and replace the initial value with the mean of the values inside that interval (separately for each column).
Speed is the main issue here.
Here is my first take of such function.
doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])
### get indeces of rows falling in each window
COORDS <- as.list(coords)
WINDOWINDS <- sapply(COORDS, function(crds){ unique(RANGE[(crds-windHalfSize):
(crds+windHalfSize)]) })
### do windowing
wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
wind_ints[i,] <- apply(as.matrix(intensities[WINDOWINDS[[i]],]), 2, mean)
}
return(wind_ints)
}
The code before the last for loop is quite fast and it gets me a list of the indexes I need to use for each entry. However then everything falls apart since I need to grind the for loop a million times, take subsets of my data table and also make sure that I have more than one row to be able to work with all the columns at once inside apply.
My second approach is to just stick the actual values in the RANGE list, fill the gaps with zeroes and do rollmean from zoo package, repeated for each column. But this is redundant since rollmean will go through all the gaps and I will only be using the values for original coordinates in the end.
Any help to make it faster without going to C would be very appreciated.
Data generation:
Original function with minor modifications I used for benchmarks:
POSSIBLE SOLUTIONS:
1) data.table
data.table
is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed,data.table
code is elegant, but unfortunately very slow:2) foreach+doSNOW
Basic routine is easy to run in parallel, so, we can benefit from it:
Benchmark shows notable speed-up on my Dual-Core processor:
3) Rcpp
Yes, I know you asked "without going to C". But, please, take a look. This code is inline and rather straightforward:
Benchmark:
I hope results are quite motivating. While data fits in memory
Rcpp
version is pretty fast. Say, withN <- 1e6
andM <-100
I got:Naturally, after R starts using swap everything slows down. With really large data that doesn't fit in memory you should consider
sqldf
,ff
orbigmemory
.Rollapply works great with a small dataset. However, if you are working with several million rows (genomics) it is quite slow.
The following function is super fast:
Details here.