Block bootstrap from subject list

2020-02-13 10:00发布

问题:

I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.

I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.

  • Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
  • Using a loop and subset approach, like below, seems computationally burdensome.
  • Note that for my real data frame, n, and the number iterations is much larger than the example below.

My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use

sample(unique(df1$subject),n,replace=TRUE)

to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.

Example slow code:

require(plm)
data("Grunfeld", package="plm")

firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()

for(j in 1:iterations){

  v = sample(length(firms),n,replace=TRUE)
  newdata = NULL

  for(i in 1:n){
    newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
  }

  reg1 = lm(value ~ inv + capital, data = newdata)
  mybootresults[[j]] = coefficients(reg1)

}

mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults

  (Intercept)      inv    capital
1    373.8591 6.981309 -0.9801547
2    370.6743 6.633642 -1.4526338
3    528.8436 6.960226 -1.1597901
4    331.6979 6.239426 -1.0349230
5    507.7339 8.924227 -2.8661479
...
...

回答1:

How about something like this:

myfit <- function(x, i) {
   mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
   coefficients(lm(value ~ inv + capital, data = mydata))
}

firms <- unique(Grunfeld$firm)

b0 <- boot(firms, myfit, 999)


回答2:

You can also use the tsboot function in the boot package with fixed block resampling scheme.

require(plm)
require(boot)
data(Grunfeld)

### each firm is of length 20
table(Grunfeld$firm)
##  1  2  3  4  5  6  7  8  9 10 
## 20 20 20 20 20 20 20 20 20 20


blockboot <- function(data) 
{
 coefficients(lm(value ~ inv + capital, data = data))

}

### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")

boot.1    
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 410.81557 -25.785972    174.3766
## t2*   5.75981   0.451810      2.0261
## t3*  -0.61527   0.065322      0.6330

dim(boot.1$t)
## [1] 99  3

head(boot.1$t)
##        [,1]   [,2]      [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283  0.031324
## [3,] 479.74 3.2531  0.637298
## [4,] 557.79 4.5284  0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860


回答3:

The solution needs to be modified to manage fixed effects.

library(boot)  # for boot
library(plm)   # for Grunfeld
library(dplyr) # for left_join

## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")

## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)

## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
    ## join the sampled firms to their firm-year data
    mydata <- left_join(X[i,], Grunfeld, by="firm")
    ## Distinguish between multiple resamples of the same firm
    ## Otherwise they have the same id in the fixed effects regression
    ## And trouble ensues
    mydata  <- mutate(group_by(mydata,firm,year),
                      firm_uniq4boot = paste(firm,"+",row_number())
                      )
    ## Run regression with and without firm fixed effects
    c(coefficients(lm(value ~ inv + capital, data = mydata)),
    coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
    }

set.seed(1)
system.time(b <- boot(firms, myfit, 1000))

summary(b)

summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))


回答4:

Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.

# get function for boot
myIndex <- function(x, i) {
  # select the observations to subset. Likely repeated observations
  blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
  # run regression for given replicate, return estimated coefficients
  coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}

now, bootstrap

# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)

Run the accepted answer

set.seed(1234)
b0 <- boot(firms, myfit, 200)

Let's eyeball a comparison

using indexing

b1

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = firms, statistic = myIndex, R = 200)


Bootstrap Statistics :
       original      bias    std. error
t1* 410.8155650 -6.64885086 197.3147581
t2*   5.7598070  0.37922066   2.4966872
t3*  -0.6152727 -0.04468225   0.8351341

Original version

b0

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = firms, statistic = myfit, R = 200)


Bootstrap Statistics :
       original      bias    std. error
t1* 410.8155650 -6.64885086 197.3147581
t2*   5.7598070  0.37922066   2.4966872
t3*  -0.6152727 -0.04468225   0.8351341

These look pretty close. Now, a bit more checking

identical(b0$t, b1$t)
[1] TRUE

and

identical(summary(b0), summary(b1))
[1] TRUE

Finally, we'll do a quick benchmark

library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)}, 
              rbind={b0 <- boot(firms, myfit, 200)})

On my computer, this returns

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866   100
 rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759   100

So, direct indexing is more than 2 times faster at every level of the distribution.

note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.



回答5:

I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.

library(boot)  # for boot
library(plm)   # for Grunfeld
library(dplyr) # for left_join

# First get the data
data("Grunfeld", package="plm")

firms <- unique(Grunfeld$firm)

myfit1 <- function(x, i) {
  # x is the vector of firms
  # i are the indexes into x
  mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
  coefficients(lm(value ~ inv + capital, data = mydata))
}

myfit2 <- function(x, i) {
  # x is the vector of firms
  # i are the indexes into x
  mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
  coefficients(lm(value ~ inv + capital, data = mydata))
}

# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
  ##  user  system elapsed 
  ## 13.51    0.01   13.62 

# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
   ## user  system elapsed 
   ## 8.16    0.02    8.26 

b1
##        original     bias    std. error
## t1* 410.8155650  9.2896499 198.6877889
## t2*   5.7598070  0.5748503   2.5725441
## t3*  -0.6152727 -0.1200954   0.7829191

b2
##        original     bias    std. error
## t1* 410.8155650  9.2896499 198.6877889
## t2*   5.7598070  0.5748503   2.5725441
## t3*  -0.6152727 -0.1200954   0.7829191