split apply recombine, plyr, data.table in R

2019-02-03 18:54发布

I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table?

Sample Data:

dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21

I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table?

2条回答
Root(大扎)
2楼-- · 2019-02-03 19:14

DWin's answer is correct for v1.8.0 (as currently on CRAN). But in v1.8.1 (on R-Forge repository), := now works by group. It works for non-contiguous groups too so there is no need to setkey first for it to line up.

dtb <- as.data.table(dat)
dtb
           dte id val1 val2
1:  2001-10-02  1   10   25
2:  2001-10-03  1   11   24
3:  2001-10-04  1   12   23
4:  2001-10-02  2   13   22
5:  2001-10-03  2   14   21
dtb[, resid:=residuals(lm(val1 ~ val2)), by=id]
           dte id val1 val2         resid
1:  2001-10-02  1   10   25  1.631688e-15
2:  2001-10-03  1   11   24 -3.263376e-15
3:  2001-10-04  1   12   23  1.631688e-15
4:  2001-10-02  2   13   22  0.000000e+00
5:  2001-10-03  2   14   21  0.000000e+00

To upgrade to v1.8.1 just install from the R-Forge repo. (R 2.15.0+ is needed when installing any binary package from R-Forge) :

install.packages("data.table", repos="http://R-Forge.R-project.org")

or install from source if you can't upgrade to latest R. data.table itself only needs R 2.12.0+.

Extending to the 1MM case :

DT = data.table(dte=Sys.Date()+1:1000000, 
                id=sample(1:2, 1000000, repl=TRUE),
                val1=runif(1000000),  val2=runif(1000000) )
setkey(DT, id)
system.time(ans1 <- cbind(DT, DT[, residuals(lm(val1 ~ val2)), by="id"]) )
   user  system elapsed 
 12.272   0.872  13.182 
ans1
                dte id      val1       val2 id           V1
      1: 2012-07-02  1 0.8369147 0.57553383  1  0.336647598
      2: 2012-07-05  1 0.0109102 0.02532214  1 -0.488633325
      3: 2012-07-06  1 0.4977762 0.16607786  1 -0.001952414
     ---                                                   
 999998: 4750-05-27  2 0.1296722 0.62645838  2 -0.370627034
 999999: 4750-05-28  2 0.2686352 0.04890710  2 -0.231952238
1000000: 4750-05-29  2 0.9981029 0.91626787  2  0.497948275

system.time(DT[, resid:=residuals(lm(val1 ~ val2)), by=id])
   user  system elapsed 
  7.436   0.648   8.107 
DT
                dte id      val1       val2        resid
      1: 2012-07-02  1 0.8369147 0.57553383  0.336647598
      2: 2012-07-05  1 0.0109102 0.02532214 -0.488633325
      3: 2012-07-06  1 0.4977762 0.16607786 -0.001952414
     ---                                                
 999998: 4750-05-27  2 0.1296722 0.62645838 -0.370627034
 999999: 4750-05-28  2 0.2686352 0.04890710 -0.231952238
1000000: 4750-05-29  2 0.9981029 0.91626787  0.497948275

The example above only has 2 groups, is quite small at under 40MB, and Rprof shows 96% of the time is spent in lm. So in these cases := by group is not for a speed advantage really, but more for the convenience; i.e., less code needed to write and no superfluous columns added to the output. As size grows, the avoidance of copies comes into it and speed advantages start to show. Especially, transform in j will slow down terribly as the number of groups increases.

查看更多
Ridiculous、
3楼-- · 2019-02-03 19:26

I'm guessing this needs to be sorted by "id" to line up properly. Luckily that happens automatically when you set the key:

dat <-read.table(text="dte, id, val1, val2
 2001-10-02, 1, 10, 25
 2001-10-03, 1, 11, 24
 2001-10-04, 1, 12, 23
 2001-10-02, 2, 13, 22
 2001-10-03, 2, 14, 21
 ", header=TRUE, sep=",")
 dtb <- data.table(dat)
 setkey(dtb, "id")
 dtb[, residuals(lm(val1 ~ val2)), by="id"]
#---------------
cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"])
#---------------
            dte id val1 val2 id.1            V1
[1,] 2001-10-02  1   10   25    1  1.631688e-15
[2,] 2001-10-03  1   11   24    1 -3.263376e-15
[3,] 2001-10-04  1   12   23    1  1.631688e-15
[4,] 2001-10-02  2   13   22    2  0.000000e+00
[5,] 2001-10-03  2   14   21    2  0.000000e+00



> dat <- data.frame(dte=Sys.Date()+1:1000000, 
                    id=sample(1:2, 1000000, repl=TRUE),  
                    val1=runif(1000000),  val2=runif(1000000) )
> dtb <- data.table(dat)
> setkey(dtb, "id")
> system.time(  cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) )
   user  system elapsed 
  1.696   0.798   2.466 
> system.time( dtb[,transform(.SD,r = residuals(lm(val1~val2))),by = "id"] )
   user  system elapsed 
  1.757   0.908   2.690 

EDIT from Matthew : This is all correct for v1.8.0 on CRAN. With the small addition that transform in j is the subject of data.table wiki point 2: "For speed don't transform() by group, cbind() afterwards". But, := now works by group in v1.8.1 and is both simple and fast. See my answer for illustration (but no need to vote for it).

Well, I voted for it. Here is the console command to install v 1.8.1on a Mac (if you have the proper XCode tools avaialble, since it only there in source):

install.packages("data.table", repos= "http://R-Forge.R-project.org", type="source", 
               lib="/Library/Frameworks/R.framework/Versions/2.14/Resources/lib")

(For some reason I could not get the Mac GUI Package Installer to read r-forge as a repository.)

查看更多
登录 后发表回答