data.table syntax for split-apply-combine ala plyr

2019-08-10 21:40发布

问题:

I'm just starting to learn data.table and working my way through the vignettes--although I'm simultaneously using it in a project. How do I replace some plyr syntax with data.table?

input <- data.table(ID = c(37, 45, 900), a1 = c(1, 2, 3), a2 = c(43, 320,390), 
                      b1 = c(-0.94, 2.2, -1.223), b2 = c(2.32, 4.54, 7.21), c1 = c(1, 2, 3), 
                      c2 = c(-0.94, 2.2, -1.223))

# simple user defined function that conveys my problem
 func <- function(x, num) {
  x <- data.table(x)
  new_b <- x$b1[1]
  x2 <- within(x[1,], {
    b1 = new_b
    b2 = 51
  })
  imp <- rbindlist(replicate(num, x2, simplify= FALSE))
  return(rbindlist(list(x, imp)))
}

# wrapper function
wrap_func <- function(dat, num= 5, plyr= FALSE) {
if (plyr == TRUE) {
    return(plyr::ddply(dat, .var= "ID", .fun= func, num= num))
  } else {
    return(dat[, lapply(.SD, FUN= func, num), by= ID])
  }
}

plyr works

wrap_func(dat=input, 5, plyr=TRUE)

what is the data.table syntax?

wrap_func(dat=input, num=5, plyr=FALSE) # gives error

Thanks in advance!!

Update:

Based on @Frank's suggestion in the comments, I benchmarked this on my real data / code. Here, impute_zero_resp_all is the real equivalent of wrap_func in the example.

I start with a dataset that has ~50k rows and 1800 groups; imputation is done by group resulting in a dataset with ~170k rows and the same 1800 groups:

vec1 <- vec2 <- vector(mode= "numeric", length= 50)
for (i in 1:50) {
  vec1[i] <- system.time(impute_zero_resp_all(dat= test_dat2))[3] #DT
  vec2[i] <- system.time(impute_zero_resp_all2(dat= test_dat2))[3] #PLYR 
}

summary(vec1); summary(vec2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.62   22.76   22.81   22.84   22.84   23.72 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27.19   27.35   27.40   27.49   27.45   30.07

quantile(vec1, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
22.620 22.670 22.728 22.760 22.786 22.810 22.824 22.840 22.870 22.917 23.720 
quantile(vec2, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
27.190 27.289 27.330 27.357 27.376 27.400 27.424 27.440 27.476 27.522 30.070

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1