Stratified sampling with restrictions: fixed total

2019-05-05 06:14发布

问题:

I have some grouped data with one row per item. I want to do a stratified sampling by group, with two restrictions: (1) a certain total sample size; (2) samples should be partitioned as evenly as possible among groups (i.e. minimal sd of the group sample sizes).

Ideally, we pick the same (fixed) number of items from each group, which is no problem when the group size is >= the desired size for all groups. However, sometimes group size is less than size. The total number of items is always above the total sample size though. For example, with a total sample size of 12, and four distinct groups, we ideally want to pick 3 items from each group

size_tot <- 12
n_grp <- 4
size <- size_tot / n_grp

Some data:

d2 <- data.table(id = 1:16,
                 grp = rep(c("a", "b", "c", "d"), c(9, 4, 2, 1)))
d2
#     id grp
#  1:  1   a
#  2:  2   a
#  3:  3   a
#  4:  4   a
#  5:  5   a
#  6:  6   a
#  7:  7   a
#  8:  8   a
#  9:  9   a
# 10: 10   b
# 11: 11   b
# 12: 12   b
# 13: 13   b
# 14: 14   c
# 15: 15   c
# 16: 16   d

My original logic was "if number of items is equal or larger to size, sample size items from the group, else just pick all items from the group". See also here, here and here.

set.seed(1)
d2[ , if(.N >= size) .SD[sample(x = .N, size = size)] else .SD, by = "grp"]

#    grp id
# 1:   a  3
# 2:   a  9
# 3:   a  5
# 4:   b 13
# 5:   b 10
# 6:   b 11
# 7:   c 14
# 8:   c 15
# 9:   d 16

In the two groups with sufficient number of items (a and b), we sampled 3 items from each. In the small groups (c and d), we just picked all there was, i.e. 2 and 1 respectively. This results in a total sample size of 9, i.e. less than the desired total size of 12. Thus, we need to sample additional items from larger groups with a surplus of items to achieve the desired total sample size. In this case, the desired sampling would be 1 additional item from "b" and two additional items from "a".

Here's how I thought of partitions with lowest sd. The total sample size can be partitioned into four groups like this:

library(partitions)
cmp <- compositions(n = size_tot, m = 4)

The partitions can then be ordered from low sd (equal sample size among groups - desired) to high sd:

std <- apply(cmp, 2, sd)
cmp2 <- cmp[ , order(std)]

cmp2[ , 1:10]
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    3    4    3    3    4    3    4    2    3     2
# [2,]    3    3    4    3    3    4    2    4    2     3
# [3,]    3    3    3    4    2    2    3    3    4     4
# [4,]    3    2    2    2    3    3    3    3    3     3

And the group sizes:

d1[ , .(n = .N), by = "grp"]
#    grp n
# 1:   a 9
# 2:   b 4
# 3:   c 2
# 4:   d 1

But how to match this partition (which sums to 12) against the group sample sizes (which not necessarily sums to 12)? Does anyone else smell XY-problem here? Thus, are there alternative approaches which I have overlooked?


PS: I have considered proportional allocation (proportionate sampling), but when distribution of group sizes is sufficiently skewed, such sampling does obviously not respect the absolute total sample size and does not distribute samples evenly among groups (e.g. caret::createDataPartition and strata::balancedstratification)

回答1:

I think your answer is almost there. You just need to filter on cmp2 to get the first sampling set that meets the criteria that the sampling sizes are lower or equal to the group sizes:

#Create a set of indices of sampling sizes that fit the criteria
original_groups <- d2[, .N, by = grp][,N]
valid_indexes <- apply(cmp2, 2, function(x) all(x <= original_groups))

#Take the first of these valid indices (lowest variance)
sampling_sizes <- cmp2[,which(valid_indexes)[1]]

#Create a sampling size variable on the datatable
d2[, sampling_size := rep(sampling_sizes, original_groups)]

#Sample as before
d2[ , .SD[sample(x = .N, size = sampling_size)], by = "grp"]