Hierarchical/Nested Bootstrapping Means

2019-08-27 03:04发布

问题:

I am trying to perform hierarchical bootstrapping to get some sample means from a large dataset with a nested data structure.

I have a dataset that is analogous to this:

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
     '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- data.frame(cbind(ball, box, triangles))
df
--
ball box triangles
   1   1         1
   2   1         0
   3   1         1
   4   1         3
   5   2         1
   6   2         1
   7   2         2
   8   3         2
   9   3         0
  10   3         1
  11   3         1
  12   3         0
  13   3         4

And the idea is that there are three boxes, each with a number of balls in them. Each ball has a number of triangles on it, such that it looks something like this:

My goal here is, using bootstrapping, to estimate the mean number of triangles on each ball, while controlling for which box the ball is in.

I want the simulation to sample with replacement 10,000 times from the boxes, each time randomly pulling a box, and then randomly samples the balls n times with replacement where n is the number of balls in the box (i.e. if box 1 is picked, then the simulation will randomly sample those four balls, four times, ending up with any number of responses, ex. ball 1, ball 1, ball 3, ball 4).

I want it to then calculate the mean of the number of triangles on the balls it sampled, store that value, and then sample a new box, thus repeating the process.

So far, I've tried to use an rsample method (described here: https://www.r-bloggers.com/bootstrapping-clustered-data/) like this:

#we need to sample groups aka boxes from 
#the dataframe so use list-columns in 
#tibbles
library(tidyverse)
library(tibble)
library(rsample)

Test <- df %>% nest(-box)
head(Test)

#now use bootstraps on this new tibble to 
#sample by ID
set.seed(002)
testbs <- bootstraps(Test, times = 10)
testbs

#let's look at one of the bootstrap 
#samples
as_tibble(testbs$splits[[1]]) %>% head()

#we can unnest the tibble and assess the 
#averages by box 
bs_avgtri<- map(testbs$splits, 
      ~as_tibble(.) %>% unnest() %>% 
                   group_by(box) %>% 
                   summarize(mean_tri = 
                   mean(triangles))) %>% 
                  bind_rows(.id = 'boots')
bs_avgtri

However, I think this is flawed because of how I'm nesting the data. Also the outputs I'm getting don't make sense, often displaying multiple bootstrap levels. So I'm tending to think it's going wrong, but also I'm unsure how to really parse out what the different functions are doing.

I also know the approach I'm borrowing from isn't really meant for what I'm doing, I'm trying to jerry-rig a way of doing it and I don't think it's doing what I need it to do.

The only other way I can think to do this is to write a couple nested for loops, but I am not strong with for loops in R, and I am fairly sure there's a better way.

If anyone has any insight into this I would be very very grateful!!!!

回答1:

tidyr::crossing is very handy for simulations.

library("tidyverse")

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
         '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- tibble(ball, box, triangles)

df %>%
  # How many times do you want to run the simulation?
  crossing(rep = seq(3)) %>%
  # Next describe the sampling.
  # For each simulation and for each box...
  group_by(rep, box) %>%
  # randomly sample n() balls with replacement,
  # where n() is the number of balls in the box.
  sample_n(n(), ball, replace = TRUE) %>%
  # Compute the mean number of triangles (for each replicate, for each box)
  summarise(triangles = mean(triangles))
#> # A tibble: 9 x 3
#> # Groups:   rep [3]
#>     rep box   triangles
#>   <int> <chr>     <dbl>
#> 1     1 1          1.5 
#> 2     1 2          1.67
#> 3     1 3          2   
#> 4     2 1          2   
#> 5     2 2          1.33
#> 6     2 3          1.33
#> 7     3 1          2   
#> 8     3 2          1.67
#> 9     3 3          1.5

Created on 2019-03-04 by the reprex package (v0.2.1)



回答2:

I don't know much about rsample.

But according to your description, I think the base function sample is enough.

I wrote a simple version to achieve the mean value (based on my understanding). See if that's what you want.

set.seed(100)

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
         '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)

names(ball) = box
names(triangles) = ball

sample_balls = function(input_ball){
  chosen_box = sample(names(input_ball), 1, replace = T)
  chosen_balls = ball[which(names(input_ball) == chosen_box)]
  sampled_balls = sample(chosen_balls, length(chosen_balls), replace = T)
  return(sampled_balls)
}

nTriangles = unlist(lapply(1:100, function(x){
  nTriangle = triangles[sample_balls(ball)]
}))

mean(nTriangles)
#> [1] 1.331237