ggplot2 - Multi-group histogram with in-group prop

2020-01-30 07:51发布

问题:

I have three cohorts of students identified by an ExperimentCohort factor. For each student, I have a LetterGrade, also a factor. I'd like to plot a histogram-like bar graph of LetterGrade for each ExperimentCohort. Using

ggplot(df, alpha = 0.2, 
       aes(x = LetterGrade, group = ExperimentCohort, fill = ExperimentCohort))                                                                                                                                                       
  + geom_bar(position = "dodge")

gets me very close, but the three ExperimentCohorts don't have the same number of students. To compare these on a more even field, I'd like the y-axis to be the in-cohort proportion of each letter-grade. So far, short of calculating this proportion and putting it in a separate dataframe before plotting, I have not been able to find a way to do this.

Every solution to a similar question on SO and elsewhere involves aes(y = ..count../sum(..count..)), but sum(..count..) is executed across the whole dataframe rather than within each cohort. Anyone got a suggestion? Here's code to create an example dataframe:

df <- data.frame(ID = 1:60, 
        LetterGrade = sample(c("A", "B", "C", "D", "E", "F"), 60, replace = T),
        ExperimentCohort = sample(c("One", "Two", "Three"), 60, replace = T))

Thanks.

回答1:

Wrong solution

You can use stat_bin() and y=..density.. to get percentages in each group.

ggplot(df, alpha = 0.2,
      aes(x = LetterGrade, group = ExperimentCohort, fill = ExperimentCohort))+
      stat_bin(aes(y=..density..), position='dodge')

UPDATE - correct solution

As pointed out by @rpierce y=..density.. will calculate density values for each group not the percentages (they are not the same).

To get the correct solution with percentages one way is to calculate them before plotting. For this used function ddply() from library plyr. In each ExperimentCohort calculated proportions using functions prop.table() and table() and saved them as prop. With names() and table() got back LetterGrade.

df.new<-ddply(df,.(ExperimentCohort),summarise,
              prop=prop.table(table(LetterGrade)),
              LetterGrade=names(table(LetterGrade)))

 head(df.new)
  ExperimentCohort       prop LetterGrade
1              One 0.21739130           A
2              One 0.08695652           B
3              One 0.13043478           C
4              One 0.13043478           D
5              One 0.30434783           E
6              One 0.13043478           F

Now use this new data frame for plotting. As proportions are already calculated - provided them as y values and added stat="identity" inside the geom_bar.

ggplot(df.new,aes(LetterGrade,prop,fill=ExperimentCohort))+
  geom_bar(stat="identity",position='dodge')



回答2:

You can also do this by creating a weight column that sums to 1 for each group:

ggplot(df %>%
         group_by(ExperimentCohort) %>%
         mutate(weight = 1 / n()),
       aes(x = LetterGrade, fill = ExperimentCohort)) +
  geom_histogram(aes(weight = weight), stat = 'count', position = 'dodge')


回答3:

I recently attempted this and received an error calling ddply: Column prop must be length 1 (a summary value), not 6. Spent some time with ddply but couldn't quite get the solution to work so I offer up an alternative (note this still makes use of plyr):

df.new <- df2 %>% 
    group_by(ExperimentCohort,LetterGrade) %>% 
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))

Then you can plot it just as @didzis-elferts mentioned:

ggplot(df.new,aes(LetterGrade,freq,fill=ExperimentCohort))+
    geom_bar(stat="identity",position='dodge')


标签: r ggplot2