Calculate relative frequency for a certain group

2020-04-14 08:35发布

问题:

I have a data.frame of categorical variables that I have divided into groups and I got the counts for each group.

My original data nyD looks like:

Source: local data frame [7 x 3]
Groups: v1, v2, v3

  v1    v2   v3
1  a  plus  yes
2  a  plus  yes
3  a minus   no
4  b minus  yes
5  b     x  yes
6  c     x notk
7  c     x notk

I performed the following operations using dplyr:

ny1 <- nyD %>% group_by(v1,v2,v3)%>%
           summarise(count=n()) %>%
           mutate(prop = count/sum(count))


My data "ny1" looks like:

Source: local data frame [5 x 5]
Groups: v1, v2

  v1    v2   v3 count prop
1  a minus   no     1    1
2  a  plus  yes     2    1
3  b minus  yes     1    1
4  b     x  yes     1    1
5  c     x notk     2    1

I want to calculate the relative frequency in relation to the V1 Groups in the prop variable. The prop variable should be the corresponding count divided by the "sum of counts for V1 group". V1 group has a total of 3 "a", 2 "b" and 1 "c". That is, ny1$prop[1] <- 1/3, ny1$prop[2] <- 2/3.... The mutate operation where using count/sum(count) is not correct. I need to specify that the sum should be realed only to V1 group. Is there a way to use dplyr to achieve this?

回答1:

You can do this whole thing in one step (from your original data nyD and without creating ny1). That is because when you'll run mutate after summarise, dplyr will drop one aggregation level (v2) by default (certainly my favorite feature in dplyr) and will aggregate only by v1

nyD %>% 
   group_by(v1, v2) %>%
   summarise(count = n()) %>%
   mutate(prop = count/sum(count))

# Source: local data frame [5 x 4]
# Groups: v1
# 
#   v1    v2 count      prop
# 1  a minus     1 0.3333333
# 2  a  plus     2 0.6666667
# 3  b minus     1 0.5000000
# 4  b     x     1 0.5000000
# 5  c     x     2 1.0000000

Or a shorter version using count (Thanks to @beginneR)

df %>% 
  count(v1, v2) %>% 
  mutate(prop = n/sum(n))


标签: r dplyr