Fast rolling mean + summarize

2019-05-08 13:49发布

问题:

In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. The example below will hopefully be clear. I have tried several approaches, and the fastest up to now seems to be using roll_mean from the package RcppRoll for the running mean, and aggregate for picking the maximum. Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; this is preferred.

#Example data frame of 10k measurements from 2001 to 2014
n <- 100000
df <- data.frame(rawdata=rnorm(n),
                 year=sort(sample(2001:2014, size=n, replace=TRUE))
                 ) 

ww <- 1:120 #Vector of window widths

dfsumm <- as.data.frame(matrix(nrow=14, ncol=121))
dfsumm[,1] <- 2001:2014
colnames(dfsumm) <- c("year", paste0("D=", ww))

system.time(for (i in 1:length(ww)) {
  #Do the rolling mean for this ww
  df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)
  #Aggregate maxima for each year
  dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]
}) #28s on my machine
dfsumm

This gives the desired output: a data.frame with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year.

However, it still takes too long to compute (as I have to compute thousands of these). I have tried playing around with other options, namely dplyr and data.table, but I've been unable to find something faster due to my lack of knowledge of those packages.

Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)?

回答1:

Memory management, i.e. allocation and copies, is killing you with your approach.

Here is a data.table approach, which assigns by reference:

library(data.table)
setDT(df)
alloc.col(df, 200) #allocate sufficient columns

#assign rolling means in a loop
for (i in seq_along(ww)) 
  set(df, j = paste0("D", i),  value = roll_mean(df[["rawdata"]], 
                                        ww[i], na.rm=TRUE, fill=NA))

dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate


回答2:

Using new frollmean function (added in data.table v1.12.0) you can do the following

th = setDTthreads(1L)
df[, paste0("D",ww) := frollmean(rawdata, ww, na.rm=TRUE)]
dfsumm <- df[, lapply(.SD, max, na.rm=TRUE), by=year]
setDTthreads(th)

You should consider shifting your parallelism down, as this use case is well parallelized in frollmean. Also grouping operation is utilizing parallel processing.



回答3:

One performance issue you create is using dynamically growing a vector using cbind. You could try to allocate the expected size beforehand, and later populating it using dfsumm[x] <- y.