In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. The example below will hopefully be clear.
I have tried several approaches, and the fastest up to now seems to be using roll_mean
from the package RcppRoll
for the running mean, and aggregate
for picking the maximum.
Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; this is preferred.
#Example data frame of 10k measurements from 2001 to 2014
n <- 100000
df <- data.frame(rawdata=rnorm(n),
year=sort(sample(2001:2014, size=n, replace=TRUE))
)
ww <- 1:120 #Vector of window widths
dfsumm <- as.data.frame(matrix(nrow=14, ncol=121))
dfsumm[,1] <- 2001:2014
colnames(dfsumm) <- c("year", paste0("D=", ww))
system.time(for (i in 1:length(ww)) {
#Do the rolling mean for this ww
df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)
#Aggregate maxima for each year
dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]
}) #28s on my machine
dfsumm
This gives the desired output: a data.frame
with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year.
However, it still takes too long to compute (as I have to compute thousands of these). I have tried playing around with other options, namely dplyr
and data.table
, but I've been unable to find something faster due to my lack of knowledge of those packages.
Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)?
Memory management, i.e. allocation and copies, is killing you with your approach.
Here is a data.table approach, which assigns by reference:
Using new
frollmean
function (added in data.table v1.12.0) you can do the followingYou should consider shifting your parallelism down, as this use case is well parallelized in
frollmean
. Also grouping operation is utilizing parallel processing.One performance issue you create is using dynamically growing a vector using
cbind
. You could try to allocate the expected size beforehand, and later populating it usingdfsumm[x] <- y
.