Sampling a specific age distribution from a datase

2019-05-31 11:07发布

问题:

Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.

How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:

0.3 * Age 0 - 30

0.3 * Age 31 - 50

0.2 * Age 51 - 69

0.2 * Age 70 - 100

Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?

As an example, this is how I have created the population distribution of Maine:

set.seed(123)
library(magrittr) 

    popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))

    Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)

    Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()

    names(Texas) <- c("Age")

Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...

回答1:

Below are four different approaches. Two use functions from, respectively, the splitstackshape and sampling packages, one uses base mapply, and one uses map2 from the purrr package (which is part of the tidyverse collection of packages).

First let's set up some fake data and sampling parameters:

# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))

# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)

# Total number of people sampled
n = 1000

# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))

Using the above sampling parameters, we want to sample n total values with a proportion probs from each age group.

Option 1: mapply

mapply can apply multiple arguments to a function. Here, the arguments are (1) the data frame df split into the four age groupings, and (2) probs*n, which gives the number of rows we want from each age group:

df.sample = mapply(a=split(df, df$age.groups), b=probs*n, 
       function(a,b) {
         a[sample(1:nrow(a), b), ]
       }, SIMPLIFY=FALSE)

mapply returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:

df.sample = do.call(rbind, df.sample)

Check the sampling:

table(df.sample$age.groups)
[0,30)  [30,51)  [51,70) [70,Inf) 
   300      300      200      200

Option 2: stratified function from the splitstackshape package

The size argument requires a named vector with the number of samples from each stratum.

library(splitstackshape)

df.sample2 = stratified(df, "age.groups", size=probs*n)

Option 3: strata function from the sampling package

This option is by far the slowest.

library(sampling)

# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]

sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")

df.sample3 = df[sampled.rows$ID_unit, ] 

Option 4: tidyverse packages

map2 is like mapply in that it applies two arguments in parallel to a function, in this case the dplyr package's sample_n function. map2 returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows.

library(dplyr)
library(purrr)

df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows

Timings

library(microbenchmark)
Unit: milliseconds
       expr        min         lq       mean     median         uq       max neval cld
     mapply   86.77215  110.82979  156.66855  123.95275  145.25115  486.2078    10  a 
     strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250    10   b
 stratified   38.33495   41.76831   89.93954   45.43525   79.18461  408.2346    10  a 
  tidyverse   71.48638  135.49113  143.12011  142.86866  155.72665  192.4174    10  a