Block bootstrap from subject list

I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.

I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.

Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.

My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use

sample(unique(df1$subject),n,replace=TRUE)

to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.

Example slow code:

require(plm)
data("Grunfeld", package="plm")

firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()

for(j in 1:iterations){

  v = sample(length(firms),n,replace=TRUE)
  newdata = NULL

  for(i in 1:n){
    newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
  }

  reg1 = lm(value ~ inv + capital, data = newdata)
  mybootresults[[j]] = coefficients(reg1)

}

mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults

  (Intercept)      inv    capital
1    373.8591 6.981309 -0.9801547
2    370.6743 6.633642 -1.4526338
3    528.8436 6.960226 -1.1597901
4    331.6979 6.239426 -1.0349230
5    507.7339 8.924227 -2.8661479
...
...

标签： r regression plyr statistics-bootstrap

5条回答

对你真心纯属浪费

2楼-- · 2020-02-13 10:08

How about something like this:

myfit <- function(x, i) {
   mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
   coefficients(lm(value ~ inv + capital, data = mydata))
}

firms <- unique(Grunfeld$firm)

b0 <- boot(firms, myfit, 999)

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2020-02-13 10:10

The solution needs to be modified to manage fixed effects.

library(boot)  # for boot
library(plm)   # for Grunfeld
library(dplyr) # for left_join

## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")

## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)

## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
    ## join the sampled firms to their firm-year data
    mydata <- left_join(X[i,], Grunfeld, by="firm")
    ## Distinguish between multiple resamples of the same firm
    ## Otherwise they have the same id in the fixed effects regression
    ## And trouble ensues
    mydata  <- mutate(group_by(mydata,firm,year),
                      firm_uniq4boot = paste(firm,"+",row_number())
                      )
    ## Run regression with and without firm fixed effects
    c(coefficients(lm(value ~ inv + capital, data = mydata)),
    coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
    }

set.seed(1)
system.time(b <- boot(firms, myfit, 1000))

summary(b)

summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))

0人赞添加讨论(0) 举报

Luminary・发光体

4楼-- · 2020-02-13 10:23

You can also use the tsboot function in the boot package with fixed block resampling scheme.

require(plm)
require(boot)
data(Grunfeld)

### each firm is of length 20
table(Grunfeld$firm)
##  1  2  3  4  5  6  7  8  9 10 
## 20 20 20 20 20 20 20 20 20 20


blockboot <- function(data) 
{
 coefficients(lm(value ~ inv + capital, data = data))

}

### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")

boot.1    
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 410.81557 -25.785972    174.3766
## t2*   5.75981   0.451810      2.0261
## t3*  -0.61527   0.065322      0.6330

dim(boot.1$t)
## [1] 99  3

head(boot.1$t)
##        [,1]   [,2]      [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283  0.031324
## [3,] 479.74 3.2531  0.637298
## [4,] 557.79 4.5284  0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860

0人赞添加讨论(0) 举报

ら.Afraid

5楼-- · 2020-02-13 10:24

Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.

# get function for boot
myIndex <- function(x, i) {
  # select the observations to subset. Likely repeated observations
  blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
  # run regression for given replicate, return estimated coefficients
  coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}

now, bootstrap

# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)

Run the accepted answer

set.seed(1234)
b0 <- boot(firms, myfit, 200)

Let's eyeball a comparison

using indexing

b1

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = firms, statistic = myIndex, R = 200)


Bootstrap Statistics :
       original      bias    std. error
t1* 410.8155650 -6.64885086 197.3147581
t2*   5.7598070  0.37922066   2.4966872
t3*  -0.6152727 -0.04468225   0.8351341

Original version

b0

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = firms, statistic = myfit, R = 200)


Bootstrap Statistics :
       original      bias    std. error
t1* 410.8155650 -6.64885086 197.3147581
t2*   5.7598070  0.37922066   2.4966872
t3*  -0.6152727 -0.04468225   0.8351341

These look pretty close. Now, a bit more checking

identical(b0$t, b1$t)
[1] TRUE

and

identical(summary(b0), summary(b1))
[1] TRUE

Finally, we'll do a quick benchmark

library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)}, 
              rbind={b0 <- boot(firms, myfit, 200)})

On my computer, this returns

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866   100
 rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759   100

So, direct indexing is more than 2 times faster at every level of the distribution.

note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.

0人赞添加讨论(0) 举报

乱世女痞

6楼-- · 2020-02-13 10:24

I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.

library(boot)  # for boot
library(plm)   # for Grunfeld
library(dplyr) # for left_join

# First get the data
data("Grunfeld", package="plm")

firms <- unique(Grunfeld$firm)

myfit1 <- function(x, i) {
  # x is the vector of firms
  # i are the indexes into x
  mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
  coefficients(lm(value ~ inv + capital, data = mydata))
}

myfit2 <- function(x, i) {
  # x is the vector of firms
  # i are the indexes into x
  mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
  coefficients(lm(value ~ inv + capital, data = mydata))
}

# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
  ##  user  system elapsed 
  ## 13.51    0.01   13.62 

# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
   ## user  system elapsed 
   ## 8.16    0.02    8.26 

b1
##        original     bias    std. error
## t1* 410.8155650  9.2896499 198.6877889
## t2*   5.7598070  0.5748503   2.5725441
## t3*  -0.6152727 -0.1200954   0.7829191

b2
##        original     bias    std. error
## t1* 410.8155650  9.2896499 198.6877889
## t2*   5.7598070  0.5748503   2.5725441
## t3*  -0.6152727 -0.1200954   0.7829191

0人赞添加讨论(0) 举报

Block bootstrap from subject list

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间