I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind()
stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
- Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
- Using a loop and subset approach, like below, seems computationally burdensome.
- Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split()
command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf
from the plyr
package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
The solution needs to be modified to manage fixed effects.
You can also use the tsboot function in the
boot
package with fixed block resampling scheme.Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except
boot
). The key here is to usewhich
and integer indexing to construct each data.frame replicate rather thansplit/subset
anddo.call/rbind
.now, bootstrap
Run the accepted answer
Let's eyeball a comparison
using indexing
Original version
These look pretty close. Now, a bit more checking
and
Finally, we'll do a quick benchmark
On my computer, this returns
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the
myIndex
ormyfit
function to only include the variables of interest in the returned vector.I found a method using
dplyr::left_join
that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.