I want to bootstrap a data set that has groups in it. A simple scenario would be bootstrapping simple means:
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
boot(data, stat, R = 10)
This gives me the error incorrect number of subscripts on matrix
, because of by = "group"
part. I managed to solve it using subsetting, but don't like this solution. Is there simpler way to make this kind of task work?
In particular, I'd like to introduce an additional argument in the statistics function like stat(x, i, groupvar)
and pass it to the boot function like boot(data, stat(groupvar = group), R = 100)
?
This should do it:
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1
Using
boot * 1.3-18 2016-02-23 CRAN (R 3.2.3)
data.table * 1.9.7 2015-10-05 Github (Rdatatable/data.table@d607425)
I received an error using the OP's code with the answer supplied by @eddi:
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1
Produces the error message:
Error in eval(expr, envir, enclos) : object 'group' not found
The error is fixed by removing by=group
from the function stat
:
set.seed(1000)
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1
Which produces the following Bootstrap Statistics results:
[[1]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = .SD, statistic = stat, R = 10)
Bootstrap Statistics :
original bias std. error
t1* 0.5158232 0.004930451 0.01576641
t2* 0.5240713 -0.001851889 0.02851483
[[2]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = .SD, statistic = stat, R = 10)
Bootstrap Statistics :
original bias std. error
t1* 0.5142383 -0.0072475030 0.02568692
t2* 0.5291694 -0.0001509404 0.02378447
Below, I modify the sample dataset to highlight which Bootstrap Statistic goes with which group-column combination:
Consider group 1 which has a mean value of 10 for x1 and a mean value of 10000 for x2 and group 2 which has a mean value of 2000 for x1 and a mean value of 8000 for x2:
data2 <- as.data.table(list(x1 = c(runif(100, 9,11),runif(100, 1999,2001)), x2 = c(runif(100, 9999,10001),runif(100, 7999,8001)), group = rep(c(1,2), each=100)))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data2[, list(list(boot(.SD, stat, R = 10))), by = group]$V1
Which gives:
[[1]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = .SD, statistic = stat, R = 10)
Bootstrap Statistics :
original bias std. error
t1* 10.00907 0.007115938 0.04349184
t2* 9999.90176 -0.019569568 0.06160653
[[2]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = .SD, statistic = stat, R = 10)
Bootstrap Statistics :
original bias std. error
t1* 1999.965 0.031694179 0.06561209
t2* 8000.110 -0.006569872 0.03992401
Lots of problems in your code before you even get to the by group part.
Did you mean something like this?
data <- as.data.frame(list(x1 = runif(200), x2 = runif(200), group = factor(sample(letters[1:2]))))
stat <- function(x, i) c(m1 = mean(x$x1[i]), m2 = mean(x$x2[i]))
> stat(x,1:10)
m1 m2
0.4465738 0.5522221
Then from there you can worry about doing it by group however you choose to.
For instance:
library(plyr)
dlply( data, .(group), function( dat ) boot(dat, stat, R=10) )
For bigger datasets, try data.table
:
by( seq(nrow(data)), data$group, function(idx) myboot(data[idx,]))
I went with by()
rather than the data.table
's ,by=
argument because you want the output to be a list. There may be some functionality I don't know about for doing that, but I couldn't find it (see the edit history for the problem it was causing).
The subsetting is still done via the data.table
's []
method, so it should be plenty fast.