data.table time subset vs xts time subset

2019-02-13 22:29发布

问题:

Hi I am looking to subset some minutely data by time. I normally use xts doing something like:

subset.string <- 'T10:00/T13:00' 
xts.min.obj[subset.string]

to get all the rows which are between 10am and 1pm (inclusive) EACH DAY and have the output as an xts format. But is a bit slow for my purposes...e.g

j <- xts(rnorm(10e6),Sys.time()-(10e6:1))
system.time(j['T10:00/T16:00'])
   user  system elapsed 
  5.704   0.577  17.115

I know that data.table is v fast and at subsetting large datasets so am wondering if in conjunction with the fasttime package to deal with fast POSIXct creations, if it would be worth it to create a function like

dt.time.subset <- function(xts.min.obj, subset.string){
  require(data.table)
  require(fasttime)
  x.dt <- data.table(ts=format(index(xts.min.obj),"%Y-%m-%d %H:%M:%S %Z"),
                     coredata(xts.min.obj))
  out <- x.dt[,some.subsetting.operation.using."%between%"]
  xts(out,fastPOSIXct(out[,ts])
}

to convert the xts.min.obj into a data.table add some sort of character index and then use data.table to subset the relevant rows use the output row index with fasttime to recreate an xts output? or is this too many excess operations for something that is already highly optimised and written in C?

回答1:

If you're ok with specifying your range in UTC, you can do:

j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)]
# +60 because xts includes that minute; you'll need to offset the times
# appropriately to match with xts unless you live in UTC :)

j <- xts(rnorm(10e6),Sys.time()-(10e6:1))
system.time(j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)])
#  user  system elapsed 
#  1.17    0.08    1.25 
# likely faster on your machine as mine takes minutes to run the OP bench