How do I use dplyr to create proportions of a level of a factor variable for each state? For example, I'd like to add a variable that indicates the percent of females within each state to the data frame.
# gen data
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)
Here's an attempt that I know is wrong, but gets me access to the information:
middle %>%
group_by(state, gender %in%c("Female")) %>%
summarise(count = n()) %>%
mutate(test_count = count)
I have a hard time with the count and mutate functions, which makes it hard to get much further. It doesn't behave as I'd expect.
To add a new column to your existing data frame:
school.data %>%
group_by(state) %>%
mutate(pct.female = mean(gender == "Female"))
Use summarize
rather than mutate
if you just want one row per state rather than adding a column to the original data.
school.data %>%
group_by(state) %>%
summarize(pct.female = mean(gender == "Female"))
# # A tibble: 2 x 2
# state pct.female
# <fctr> <dbl>
# 1 Idaho 0.75
# 2 Maine 0.70
Gregor's answer gets to the heart of it. Here's a version that would give you counts and proportions for both genders per state:
library(dplyr)
gender.proportions <- group_by(school.data, state, gender) %>%
summarize(n = length(student.id)) %>% # count per gender
ungroup %>% group_by(state) %>%
mutate(proportion = n / sum(n)) # proportion per gender
# state gender n proportion
# <fctr> <fctr> <int> <dbl>
#1 Idaho Female 16 0.80
#2 Idaho Male 4 0.20
#3 Maine Female 11 0.55
#4 Maine Male 9 0.45
Edit:
In reference to OP's comment/request, the code below would repeat the male and female proportions for each individual student in each state:
gender.proportions <- group_by(school.data, state) %>%
mutate(prop.female = mean(gender == 'Female'), prop.male = mean(gender == 'Male'))
student.id state gender prop.female prop.male
<int> <fctr> <fctr> <dbl> <dbl>
1 479 Idaho Male 0.8 0.2
2 634 Idaho Female 0.8 0.2
3 175 Idaho Female 0.8 0.2
4 527 Idaho Female 0.8 0.2
5 368 Idaho Female 0.8 0.2
6 423 Idaho Male 0.8 0.2
7 357 Idaho Female 0.8 0.2
8 994 Idaho Female 0.8 0.2
9 479 Idaho Female 0.8 0.2
10 634 Idaho Female 0.8 0.2
# ... with 30 more rows
Here is one solution using a left_join
.
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)
school.data %>%
group_by(state) %>%
mutate(gender_id = ifelse(gender == "Female", 1, 0)) %>%
summarise(female_count = sum(gender_id)) %>%
left_join(school.data %>%
group_by(state) %>%
summarise(state_count = n()),
by = c("state" = "state")
) %>%
mutate(percent_female = female_count / state_count)