Fast rolling mean + summarize

In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. The example below will hopefully be clear. I have tried several approaches, and the fastest up to now seems to be using roll_mean from the package RcppRoll for the running mean, and aggregate for picking the maximum. Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; this is preferred.

#Example data frame of 10k measurements from 2001 to 2014
n <- 100000
df <- data.frame(rawdata=rnorm(n),
                 year=sort(sample(2001:2014, size=n, replace=TRUE))
                 ) 

ww <- 1:120 #Vector of window widths

dfsumm <- as.data.frame(matrix(nrow=14, ncol=121))
dfsumm[,1] <- 2001:2014
colnames(dfsumm) <- c("year", paste0("D=", ww))

system.time(for (i in 1:length(ww)) {
  #Do the rolling mean for this ww
  df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA)
  #Aggregate maxima for each year
  dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2]
}) #28s on my machine
dfsumm

This gives the desired output: a data.frame with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year.

However, it still takes too long to compute (as I have to compute thousands of these). I have tried playing around with other options, namely dplyr and data.table, but I've been unable to find something faster due to my lack of knowledge of those packages.

Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)?

标签： r dataframe data.table aggregate rolling-computation

3条回答

趁早两清

2楼-- · 2019-05-08 14:25

Memory management, i.e. allocation and copies, is killing you with your approach.

Here is a data.table approach, which assigns by reference:

library(data.table)
setDT(df)
alloc.col(df, 200) #allocate sufficient columns

#assign rolling means in a loop
for (i in seq_along(ww)) 
  set(df, j = paste0("D", i),  value = roll_mean(df[["rawdata"]], 
                                        ww[i], na.rm=TRUE, fill=NA))

dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2019-05-08 14:32

Using new frollmean function (added in data.table v1.12.0) you can do the following

th = setDTthreads(1L)
df[, paste0("D",ww) := frollmean(rawdata, ww, na.rm=TRUE)]
dfsumm <- df[, lapply(.SD, max, na.rm=TRUE), by=year]
setDTthreads(th)

You should consider shifting your parallelism down, as this use case is well parallelized in frollmean. Also grouping operation is utilizing parallel processing.

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2019-05-08 14:43

One performance issue you create is using dynamically growing a vector using cbind. You could try to allocate the expected size beforehand, and later populating it using dfsumm[x] <- y.

0人赞添加讨论(0) 举报

Fast rolling mean + summarize

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间