I have two-level hierarchical data and I'm trying to perform non-parametric bootstrap sampling on the highest level, i.e., randomly sampling the highest-level clusters with replacement while keeping the original within-cluster data.
I want to achieve this using the boot() function in the {boot} package, for the reason that I then would like to build BCa confidence intervals using boot.ci() which requires a boot object.
Here follows my unlucky attempt - running a debug on the boot call showed that random sampling is not happening at cluster level (=subject).
### create a very simple two-level dataset with 'subject' as clustering variable
rho <- 0.4
dat <- expand.grid(
trial=factor(1:5),
subject=factor(1:3)
)
sig <- rho * tcrossprod(model.matrix(~ 0 + subject, dat))
diag(sig) <- 1
set.seed(17); dat$value <- chol(sig) %*% rnorm(15, 0, 1)
### my statistic function (adapted from here: http://biostat.mc.vanderbilt.edu/wiki/Main/HowToBootstrapCorrelatedData)
resamp.mean <- function(data, i){
cluster <- c('subject', 'trial')
# sample the clustering factor
cls <- unique(data[[cluster[1]]])[i]
# subset on the sampled clustering factors
sub <- lapply(cls, function(b) subset(data, data[[cluster[1]]]==b))
sub.2 <- do.call(rbind, sub) # join and return samples
mean((sub.2$value)) # calculate the statistic
}
debugonce(boot)
set.seed(17); dat.boot <- boot(data = dat, statistic = resamp.mean, 4)
### stepping trough the debugger until object 'i' was assigned
### investigating 'i'
# Browse[2]> head(i)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 3 7 12 13 10 14 14 15 12 12 12 4 5 9 10
[2,] 15 9 3 13 4 10 2 4 6 11 10 4 9 4 3
[3,] 8 4 7 15 10 12 9 8 9 12 4 15 14 10 4
[4,] 12 3 1 15 8 13 9 1 4 13 9 13 2 11 2
### which is not what I was hoping for.
### I would like something that looks like this, supposing indices = c(2, 2, 1) for the first resample:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 6 7 8 9 10 6 7 8 9 10 1 2 3 4 5
Any help would be very much appreciated.
I think the problem originates from the modified statistic function (specifically, the
cls
object within the function). Can you try this one? Uncomment theprint
statement to see which subjects have been sampled. It does not use theindex
argument whichboot
expects, instead it just usessample
as in the original function.One resample from the
resamp.mean
function before the mean ofvalue
is calculated looks like this: