dplyr to create aggregate percentages of factor le

2020-03-26 13:19发布

问题:

How do I use dplyr to create proportions of a level of a factor variable for each state? For example, I'd like to add a variable that indicates the percent of females within each state to the data frame.

# gen data
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)

Here's an attempt that I know is wrong, but gets me access to the information:

 middle %>%
   group_by(state, gender %in%c("Female")) %>%
   summarise(count = n()) %>%
   mutate(test_count = count)

I have a hard time with the count and mutate functions, which makes it hard to get much further. It doesn't behave as I'd expect.

回答1:

To add a new column to your existing data frame:

school.data %>% 
    group_by(state) %>%
    mutate(pct.female = mean(gender == "Female"))

Use summarize rather than mutate if you just want one row per state rather than adding a column to the original data.

school.data %>%
   group_by(state) %>%
   summarize(pct.female = mean(gender == "Female"))
# # A tibble: 2 x 2
#    state pct.female
#   <fctr>      <dbl>
# 1  Idaho       0.75
# 2  Maine       0.70


回答2:

Gregor's answer gets to the heart of it. Here's a version that would give you counts and proportions for both genders per state:

library(dplyr)

gender.proportions <- group_by(school.data, state, gender) %>% 
  summarize(n = length(student.id)) %>% # count per gender
  ungroup %>% group_by(state) %>% 
  mutate(proportion = n / sum(n)) # proportion per gender

#   state gender     n proportion
#  <fctr> <fctr> <int>      <dbl>
#1  Idaho Female    16       0.80  
#2  Idaho   Male     4       0.20
#3  Maine Female    11       0.55
#4  Maine   Male     9       0.45

Edit:

In reference to OP's comment/request, the code below would repeat the male and female proportions for each individual student in each state:

gender.proportions <- group_by(school.data, state) %>% 
  mutate(prop.female = mean(gender == 'Female'), prop.male = mean(gender == 'Male'))

   student.id  state gender prop.female prop.male
        <int> <fctr> <fctr>       <dbl>     <dbl>
1         479  Idaho   Male         0.8       0.2
2         634  Idaho Female         0.8       0.2
3         175  Idaho Female         0.8       0.2
4         527  Idaho Female         0.8       0.2
5         368  Idaho Female         0.8       0.2
6         423  Idaho   Male         0.8       0.2
7         357  Idaho Female         0.8       0.2
8         994  Idaho Female         0.8       0.2
9         479  Idaho Female         0.8       0.2
10        634  Idaho Female         0.8       0.2
# ... with 30 more rows


回答3:

Here is one solution using a left_join.

state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)

school.data %>%
    group_by(state) %>%
    mutate(gender_id = ifelse(gender == "Female", 1, 0)) %>%
    summarise(female_count = sum(gender_id)) %>%

    left_join(school.data %>%
                  group_by(state) %>%
                  summarise(state_count = n()),

              by = c("state" = "state")
    ) %>%
    mutate(percent_female = female_count / state_count)


标签: r dplyr