Related to this question.
gender <- c("F", "M", "M", "F", "F", "M", "F", "F")
age <- c(23, 25, 27, 29, 31, 33, 35, 37)
mydf <- data.frame(gender, age)
mydf[ sample( which(mydf$gender=='F'), 3 ), ]
Instead of selecting a number of rows (3 in above case), how can I randomly select 20% of rows with "F"? So of the five rows with "F", how do I randomly sample 20% of those rows.
To sample 20%, you can use this to get the sample size:
How about this:
Where 0.2 is your 20% and
length(which(mydf$gender=='F'))
is the total number of rows withF
Self-promotion alert. I wrote a function that allows convenient stratified sampling, and I've included an option to subset levels from the grouping variables before sampling.
The function is called
stratified
and can be used in the following ways:You can specify multiple groups (for example if your data frame included a "state" variable and you wanted to group by "state" and "gender" you would specify
group = c("state", "gender")
). You can also specify multiple "select" arguments (for example, if you wanted only female respondents from California and Texas, and your "state" variable used two-letter state abbreviations, you could specifyselect = list(gender = "F", state = c("CA", "TX"))
).The function itself can be found here or you can download and install the package (which gives you convenient access to the help pages and examples) by using
install_github
from the "devtools" package as follows:You can use
sample_frac()
function indplyr
package.e.g. If you want to sample 20 % within each group:
If you want to sample 20 % within each gender group: