I have three cohorts of students identified by an ExperimentCohort
factor. For each student, I have a LetterGrade
, also a factor. I'd like to plot a histogram-like bar graph of LetterGrade
for each ExperimentCohort
. Using
ggplot(df, alpha = 0.2,
aes(x = LetterGrade, group = ExperimentCohort, fill = ExperimentCohort))
+ geom_bar(position = "dodge")
gets me very close, but the three ExperimentCohorts
don't have the same number of students. To compare these on a more even field, I'd like the y-axis to be the in-cohort proportion of each letter-grade. So far, short of calculating this proportion and putting it in a separate dataframe before plotting, I have not been able to find a way to do this.
Every solution to a similar question on SO and elsewhere involves aes(y = ..count../sum(..count..))
, but sum(..count..) is executed across the whole dataframe rather than within each cohort. Anyone got a suggestion? Here's code to create an example dataframe:
df <- data.frame(ID = 1:60,
LetterGrade = sample(c("A", "B", "C", "D", "E", "F"), 60, replace = T),
ExperimentCohort = sample(c("One", "Two", "Three"), 60, replace = T))
Thanks.
Wrong solution
You can use stat_bin()
and y=..density..
to get percentages in each group.
ggplot(df, alpha = 0.2,
aes(x = LetterGrade, group = ExperimentCohort, fill = ExperimentCohort))+
stat_bin(aes(y=..density..), position='dodge')
UPDATE - correct solution
As pointed out by @rpierce y=..density..
will calculate density values for each group not the percentages (they are not the same).
To get the correct solution with percentages one way is to calculate them before plotting. For this used function ddply()
from library plyr
. In each ExperimentCohort
calculated proportions using functions prop.table()
and table()
and saved them as prop
. With names()
and table()
got back LetterGrade
.
df.new<-ddply(df,.(ExperimentCohort),summarise,
prop=prop.table(table(LetterGrade)),
LetterGrade=names(table(LetterGrade)))
head(df.new)
ExperimentCohort prop LetterGrade
1 One 0.21739130 A
2 One 0.08695652 B
3 One 0.13043478 C
4 One 0.13043478 D
5 One 0.30434783 E
6 One 0.13043478 F
Now use this new data frame for plotting. As proportions are already calculated - provided them as y
values and added stat="identity"
inside the geom_bar
.
ggplot(df.new,aes(LetterGrade,prop,fill=ExperimentCohort))+
geom_bar(stat="identity",position='dodge')
You can also do this by creating a weight
column that sums to 1 for each group:
ggplot(df %>%
group_by(ExperimentCohort) %>%
mutate(weight = 1 / n()),
aes(x = LetterGrade, fill = ExperimentCohort)) +
geom_histogram(aes(weight = weight), stat = 'count', position = 'dodge')
I recently attempted this and received an error calling ddply: Column prop must be length 1 (a summary value), not 6
. Spent some time with ddply but couldn't quite get the solution to work so I offer up an alternative (note this still makes use of plyr
):
df.new <- df2 %>%
group_by(ExperimentCohort,LetterGrade) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
Then you can plot it just as @didzis-elferts mentioned:
ggplot(df.new,aes(LetterGrade,freq,fill=ExperimentCohort))+
geom_bar(stat="identity",position='dodge')