I have some grouped data with one row per item.
I want to do a stratified sampling by group, with two restrictions: (1) a certain total sample size; (2) samples should be partitioned as evenly as possible among groups (i.e. minimal sd
of the group sample sizes).
Ideally, we pick the same (fixed) number of items from each group, which is no problem when the group size is >=
the desired size
for all groups. However, sometimes group size is less than size
. The total number of items is always above the total sample size though. For example, with a total sample size of 12, and four distinct groups, we ideally want to pick 3 items from each group
size_tot <- 12
n_grp <- 4
size <- size_tot / n_grp
Some data:
d2 <- data.table(id = 1:16,
grp = rep(c("a", "b", "c", "d"), c(9, 4, 2, 1)))
d2
# id grp
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 a
# 5: 5 a
# 6: 6 a
# 7: 7 a
# 8: 8 a
# 9: 9 a
# 10: 10 b
# 11: 11 b
# 12: 12 b
# 13: 13 b
# 14: 14 c
# 15: 15 c
# 16: 16 d
My original logic was "if number of items is equal or larger to size
, sample size
items from the group, else just pick all items from the group". See also here, here and here.
set.seed(1)
d2[ , if(.N >= size) .SD[sample(x = .N, size = size)] else .SD, by = "grp"]
# grp id
# 1: a 3
# 2: a 9
# 3: a 5
# 4: b 13
# 5: b 10
# 6: b 11
# 7: c 14
# 8: c 15
# 9: d 16
In the two groups with sufficient number of items (a and b), we sampled 3 items from each. In the small groups (c and d), we just picked all there was, i.e. 2 and 1 respectively. This results in a total sample size of 9, i.e. less than the desired total size of 12. Thus, we need to sample additional items from larger groups with a surplus of items to achieve the desired total sample size. In this case, the desired sampling would be 1 additional item from "b" and two additional items from "a".
Here's how I thought of partitions with lowest sd
. The total sample size can be partitioned into four groups like this:
library(partitions)
cmp <- compositions(n = size_tot, m = 4)
The partitions can then be ordered from low sd
(equal sample size among groups - desired) to high sd
:
std <- apply(cmp, 2, sd)
cmp2 <- cmp[ , order(std)]
cmp2[ , 1:10]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 3 4 3 3 4 3 4 2 3 2
# [2,] 3 3 4 3 3 4 2 4 2 3
# [3,] 3 3 3 4 2 2 3 3 4 4
# [4,] 3 2 2 2 3 3 3 3 3 3
And the group sizes:
d1[ , .(n = .N), by = "grp"]
# grp n
# 1: a 9
# 2: b 4
# 3: c 2
# 4: d 1
But how to match this partition (which sums to 12) against the group sample sizes (which not necessarily sums to 12)? Does anyone else smell XY-problem here? Thus, are there alternative approaches which I have overlooked?
PS: I have considered proportional allocation (proportionate sampling), but
when distribution of group sizes is sufficiently skewed, such sampling does obviously not respect the absolute total sample size and does not distribute samples evenly among groups (e.g. caret::createDataPartition
and strata::balancedstratification
)