using data.table to speed up rollapply

I am new to data.tables so apologies if this is a very basic question.

I have heard that data.tables significantly improves computational times when working with large amounts data, and so would like to see if data.table is able to help in speeding up the rollapply function.

if we have some univariate data

xts.obj <- xts(rnorm(1e6), order.by=as.POSIXct(Sys.time()-1e6:1), tz="GMT") 
colnames(xts.obj) <- "rtns"

a simple rolling quantile with width of 100 and a p of 0.75 takes a surprisingly long time...

i.e. the line of code

xts.obj$quant.75 <- rollapply(xts.obj$rtns,width=100, FUN='quantile', p=0.75)

seems to take forever...

is there anything that data.table can do to speed things up? i.e. is there a generic roll function that can be applied?

perhaps a routine to convert an xts object to a data.table object to carry out the function in a speeded up manner and then reconvert back to xts at the end?

thanks in advance

hlm

p.s. I didn't seem to be getting much of a response on the data.table mailing list so am posting up here, to see if I get a better response.

p.p.s having a quick go with another example using dataframes the data.table solution seems to take longer than the rollapply function, i.e. shown below:

> x <- data.frame(x=rnorm(10000))
> x.dt <- data.table(x)
> system.time(l1 <- as.numeric(rollapply(x,width=10,FUN=quantile,p=0.75)))   
   user  system elapsed 
   2.69    0.00    2.68 
> system.time(l <- as.numeric(unlist(x.dt[,lapply(1:((nrow(x.dt))-10+1), function(i){ x.dt[i:(i+10-1),quantile(x,p=0.75)]})])))
   user  system elapsed 
  11.22    0.00   11.51 
> identical(l,l1)
[1] TRUE

标签： r dataframe data.table apply xts

2条回答

劫难

2楼-- · 2019-04-28 12:42

data.table works quickly by splitting the data by a key. I don't think data.table currently supports a rolling key, or an expression that in the by or i arguments that would do this.

You could use the fact that subsetting is faster for data.table than a data.frame

DT <- as.data.table(x)
.x <- 1:(nrow(DT)-9)
system.time(.xl <- unlist(lapply(.x, function(.i) DT[.i:(.i+10),quantile(x,0.75, na.rm = T)])))
   user  system elapsed 
   8.77    0.00    8.77

Or you could construct key variables that will uniquely identify the rolling ids. Width = 10, therefore we need 10 columns (padded with NA_real_)

library(plyr) # for as.quoted
.j <- paste0('x',1:10, ':= c(rep(NA_real_,',0:9,'),rep(seq(',1:10,',9991,by=10),each=10), rep(NA_real_,',c(0,9:1),'))')


 datatable <- function(){
   invisible(lapply(.j, function(.jc) x.dt[,eval(as.quoted(.jc)[[1]])]))
x_roll <- rbind(x.dt[!is.na(x1),quantile(x,0.75),by=x1],
  x.dt[!is.na(x2),quantile(x,0.75),by=x2],
  x.dt[!is.na(x3),quantile(x,0.75),by=x3],
  x.dt[!is.na(x4),quantile(x,0.75),by=x4],
      x.dt[!is.na(x5),quantile(x,0.75),by=x5],
      x.dt[!is.na(x6),quantile(x,0.75),by=x6],
      x.dt[!is.na(x7),quantile(x,0.75),by=x7],
      x.dt[!is.na(x8),quantile(x,0.75),by=x8],
      x.dt[!is.na(x9),quantile(x,0.75),by=x9],
      x.dt[!is.na(x10),quantile(x,0.75),by=x10],use.names =F)


setkeyv(x_roll,'x1')

invisible(x.dt[,x1:= 1:10000])
setkeyv(x.dt,'x1')
x_roll[x.dt][, list(x,V1)]}

l1 <- function()as.numeric(rollapply(x,width=10,FUN=quantile,p=0.75))
lapply_only <- function() unclass(lapply(1:(nrow(x) - 9), function(i) quantile(x[['x']][i:(i + 9)], p=0.75)))


benchmark(datatable(),l1(),lapply_only(), replications = 5)
##            test replications elapsed relative user.self 
## 1   datatable()            5    9.41 1.000000      9.40      
## 2          l1()            5   10.97 1.165781     10.85        
## 3 lapply_only()            5   10.39 1.104145     10.35

EDIT --- benchmarking

data.table is quicker than rollapply and raw lapply. I can't test the parallel solution.

0人赞添加讨论(0) 举报

疯言疯语

3楼-- · 2019-04-28 12:46

datatable is quite irrelevant here - you're essentially running sapply on a vector, that is pretty much the fastest operation you can get (other than going to C). data frames and data tables will always be slower than vectors. You can gain a bit by using a straight vector (without xts dispatch), but the only easy way to get this done quickly is to parallelize:

> x = as.vector(xts.obj$rtns)
> system.time(unclass(mclapply(1:(length(x) - 99),
                      function(i) quantile(x[i:(i + 99)], p=0.75), mc.cores=32)))
   user  system elapsed 
325.481  15.533  11.221

If you need that even faster, then you may want to write a specialized function: the naive apply approach re-sorts every chunk which is obviously wasteful - all you need to do is to drop the one element and sort in the next one to obtain the quantile so you can expect roughly 50x speedup if you do that - but you'll have to code that yourself (so it's only worth if you use it more often ...).

0人赞添加讨论(0) 举报

using data.table to speed up rollapply

EDIT --- benchmarking

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间