Data.table: how to get the blazingly fast subsets

2019-02-19 19:21发布

问题:

I'm trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medication available for implementing the prescribed regimen. I have a function that returns the relevant result, but it runs for days on just a subset of the total data I have to run it on.

The datasets are:

 library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

I'm used to working with dplyr, but it was much slower and I rewrote things for data.table to try it out. It is driving me crazy that my colleagues working with SAS claims that this wouldn't take long for them, when it takes me hours just to load the data itself into RAM. (fread crashes R on several of my datasets). Adherence is 1,5 mio rows, and lsr is a few hundred mio. rows.

My working function is

function.AH <- function(x) {
  lsr[ID == x[1] & eksd <= x[2] & ENDDATE > x[2], ifelse(.N == 0, 0, sum(as.numeric(ENDDATE - as.Date(x[2]))))]
}
setkey(lsr, ID, eksd, ENDDATE)
adherence$AH <-apply (adherence, 1,  FUN = function.AH) #DESIRED OUTPUT

I don't know the best approach: I've looked into using a SQL database, but as I understand it this shouldn't be faster when my data fits into RAM (I have 256GB). Since the adherence data.table is actually each individual ID repeated for 500 timeperiods (i.e. ID 1: at time 1, time 2, time 3...time 500, ID 2: at time 1, time 2... etc.)I also considered using the by function on ID on lsr and some how imbedding this time interval (1:500) in the function in j.

I hope that some-one can point out how I'm using the apply function inefficiently by not somehow applying it inside the data.table-framework and thus loosing the build in efficiency. But as I'm going to be working with this data and similar sizes of data, I'd appreciate any specific suggestions for solving this faster or general suggestions for getting faster running times using other methods.

回答1:

This can be solved by updating in a non-equi join.

This avoids the memory issues caused by a cartesian join or by calling apply() which coerces a data.frame or data.table to a matrix which involves copying the data.

In addition, the OP has mentioned that lsr has a few hundred mio. rows and adherence has 1.5 mio rows (500 timeperiods times 3000 ID's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.

library(data.table)
# coerce to data.table by reference, i.e., without copying
setDT(adherence)
setDT(lsr)
# coerce to IDate to save memory
adherence[, year := as.IDate(year)]
cols <- c("eksd", "ENDDATE")
lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# update in a non-equi join
adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE), 
                      AH := as.integer(ENDDATE - x.year)][]
   ID       year AH
1:  1 2013-01-01 NA
2:  2 2013-01-01 NA
3:  3 2013-01-01 NA
4:  1 2013-02-01 64
5:  2 2013-02-01 NA
6:  3 2013-02-01 63

Note that NA indicates that no match was found. If required, the AH column can be initialised before the non-equi join by adherence[, AH := 0L].

Data

The code to create the sample datasets can be streamlined:

adherence <- data.frame(
  ID = c("1", "2", "3", "1", "2", "3"), 
  year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
  stringsAsFactors = FALSE)

lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD

Note that DDD is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.

Also note that the last statement may cause the whole data object lsr to be copied. This can be avoided by using data.table syntax which updates by reference.

library(data.table)
setDT(lsr)[, ENDDATE := eksd + DDD][]


回答2:

I am not sure why your function is slow (I think you could remove your ifelse function), but I would propose to use merge to be faster and to operate on one table only:

plouf <- lsr[adherence, on = "ID", allow.cartesian=TRUE]
plouf[,year := as.date(year)]
bob <- rbindlist(lapply(unique(adherence$year),function(x){
  plouf <- lsr[adherence[year == x], on = "ID"]
  plouf[,year := as.Date(year)]
  plouf[year >= eksd & year < ENDDATE,list(sum = sum(as.numeric(ENDDATE-as.Date(year))), year = year), by = ID]
  }))
bob

   ID sum       year
1:  1  64 2013-02-01
2:  3  63 2013-02-01

you can then merge to adherence

adherence <- setDT(adherence)
adherence[,year := as.Date(year)]
bob[adherence, on = .(ID,year)]
   ID sum       year
1:  1  NA 2013-01-01
2:  2  NA 2013-01-01
3:  3  NA 2013-01-01
4:  1  64 2013-02-01
5:  2  NA 2013-02-01
6:  3  63 2013-02-01

For reading your data use fread() function that is fast for big data