Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.
How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:
0.3 * Age 0 - 30
0.3 * Age 31 - 50
0.2 * Age 51 - 69
0.2 * Age 70 - 100
Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?
As an example, this is how I have created the population distribution of Maine:
set.seed(123)
library(magrittr)
popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))
Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)
Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()
names(Texas) <- c("Age")
Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...
Below are four different approaches. Two use functions from, respectively, the splitstackshape
and sampling
packages, one uses base mapply
, and one uses map2
from the purrr
package (which is part of the tidyverse
collection of packages).
First let's set up some fake data and sampling parameters:
# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))
# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)
# Total number of people sampled
n = 1000
# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))
Using the above sampling parameters, we want to sample n
total values with a proportion probs
from each age group.
Option 1: mapply
mapply
can apply multiple arguments to a function. Here, the arguments are (1) the data frame df
split into the four age groupings, and (2) probs*n
, which gives the number of rows we want from each age group:
df.sample = mapply(a=split(df, df$age.groups), b=probs*n,
function(a,b) {
a[sample(1:nrow(a), b), ]
}, SIMPLIFY=FALSE)
mapply
returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:
df.sample = do.call(rbind, df.sample)
Check the sampling:
table(df.sample$age.groups)
[0,30) [30,51) [51,70) [70,Inf)
300 300 200 200
Option 2: stratified
function from the splitstackshape
package
The size
argument requires a named vector with the number of samples from each stratum.
library(splitstackshape)
df.sample2 = stratified(df, "age.groups", size=probs*n)
Option 3: strata
function from the sampling
package
This option is by far the slowest.
library(sampling)
# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]
sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")
df.sample3 = df[sampled.rows$ID_unit, ]
Option 4: tidyverse
packages
map2
is like mapply
in that it applies two arguments in parallel to a function, in this case the dplyr
package's sample_n
function. map2
returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows
.
library(dplyr)
library(purrr)
df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows
Timings
library(microbenchmark)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapply 86.77215 110.82979 156.66855 123.95275 145.25115 486.2078 10 a
strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250 10 b
stratified 38.33495 41.76831 89.93954 45.43525 79.18461 408.2346 10 a
tidyverse 71.48638 135.49113 143.12011 142.86866 155.72665 192.4174 10 a