I am trying to perform hierarchical bootstrapping to get some sample means from a large dataset with a nested data structure.
I have a dataset that is analogous to this:
ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
'3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- data.frame(cbind(ball, box, triangles))
df
--
ball box triangles
1 1 1
2 1 0
3 1 1
4 1 3
5 2 1
6 2 1
7 2 2
8 3 2
9 3 0
10 3 1
11 3 1
12 3 0
13 3 4
And the idea is that there are three boxes, each with a number of balls in them. Each ball has a number of triangles on it, such that it looks something like this:
My goal here is, using bootstrapping, to estimate the mean number of triangles on each ball, while controlling for which box the ball is in.
I want the simulation to sample with replacement 10,000 times from the boxes, each time randomly pulling a box, and then randomly samples the balls n times with replacement where n is the number of balls in the box (i.e. if box 1 is picked, then the simulation will randomly sample those four balls, four times, ending up with any number of responses, ex. ball 1, ball 1, ball 3, ball 4).
I want it to then calculate the mean of the number of triangles on the balls it sampled, store that value, and then sample a new box, thus repeating the process.
So far, I've tried to use an rsample method (described here: https://www.r-bloggers.com/bootstrapping-clustered-data/) like this:
#we need to sample groups aka boxes from
#the dataframe so use list-columns in
#tibbles
library(tidyverse)
library(tibble)
library(rsample)
Test <- df %>% nest(-box)
head(Test)
#now use bootstraps on this new tibble to
#sample by ID
set.seed(002)
testbs <- bootstraps(Test, times = 10)
testbs
#let's look at one of the bootstrap
#samples
as_tibble(testbs$splits[[1]]) %>% head()
#we can unnest the tibble and assess the
#averages by box
bs_avgtri<- map(testbs$splits,
~as_tibble(.) %>% unnest() %>%
group_by(box) %>%
summarize(mean_tri =
mean(triangles))) %>%
bind_rows(.id = 'boots')
bs_avgtri
However, I think this is flawed because of how I'm nesting the data. Also the outputs I'm getting don't make sense, often displaying multiple bootstrap levels. So I'm tending to think it's going wrong, but also I'm unsure how to really parse out what the different functions are doing.
I also know the approach I'm borrowing from isn't really meant for what I'm doing, I'm trying to jerry-rig a way of doing it and I don't think it's doing what I need it to do.
The only other way I can think to do this is to write a couple nested for loops, but I am not strong with for loops in R, and I am fairly sure there's a better way.
If anyone has any insight into this I would be very very grateful!!!!