Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.
How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:
0.3 * Age 0 - 30
0.3 * Age 31 - 50
0.2 * Age 51 - 69
0.2 * Age 70 - 100
Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?
As an example, this is how I have created the population distribution of Maine:
set.seed(123)
library(magrittr)
popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))
Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)
Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()
names(Texas) <- c("Age")
Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...
Below are four different approaches. Two use functions from, respectively, the
splitstackshape
andsampling
packages, one uses basemapply
, and one usesmap2
from thepurrr
package (which is part of thetidyverse
collection of packages).First let's set up some fake data and sampling parameters:
Using the above sampling parameters, we want to sample
n
total values with a proportionprobs
from each age group.Option 1:
mapply
mapply
can apply multiple arguments to a function. Here, the arguments are (1) the data framedf
split into the four age groupings, and (2)probs*n
, which gives the number of rows we want from each age group:mapply
returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:Check the sampling:
Option 2:
stratified
function from thesplitstackshape
packageThe
size
argument requires a named vector with the number of samples from each stratum.Option 3:
strata
function from thesampling
packageThis option is by far the slowest.
Option 4:
tidyverse
packagesmap2
is likemapply
in that it applies two arguments in parallel to a function, in this case thedplyr
package'ssample_n
function.map2
returns a list of four data frames, one for each stratum, which we combine into a single data frame withbind_rows
.Timings