Creating a ggplot2 histogram with a cumulative dis

2019-07-27 15:38发布

问题:

Using ggplot2, I can create a histogram with a cumulative distribution curve with the following code. However, the stat_ecdf curve is scaled to the left y-axis.

library(ggplot2)
test.data <- data.frame(values = replicate(1, sample(0:10,1000, rep=TRUE)))
g <- ggplot(test.data, aes(x=values))
g + geom_bar() + 
    stat_ecdf() + 
    scale_y_continuous(sec.axis=sec_axis(trans = ~./100, name="percentage"))

Here is the graph generated (you can see the ecdf at the bottom):

How do I scale the stat_ecdf to the second y-axis?

回答1:

In general, you want to multiply the internally calculated ECDF value (the cumulative density), which is called ..y.., by the inverse of the axis transformation, so that its vertical extent will be similar to that of the bars:

library(tidyverse)
library(scales)

set.seed(2)
test.data <- data.frame(values = replicate(1, sample(0:10,1000, rep=TRUE)))

ggplot(test.data, aes(x=values)) +
  geom_bar(fill="grey70") + 
  stat_ecdf(aes(y=..y..*100)) + 
  scale_y_continuous(sec.axis=sec_axis(trans = ~./100 , name="percentage", labels=percent)) +
  theme_bw()

Because you distributed 1,000 values randomly among 11 buckets, it happened to turn out that both y-scales were multiples of 10. Below is a more general version.

In addition, it would be nice to be able to programmatically determine the transformation factor, so that we don't have to pick it by hand after seeing the bar heights in the plot. To do that, we calculate the height of the highest bar outside ggplot and use that value (called max_y below) in the plot. We also use the pretty function to reset max_y to the highest break value on the y-axis associated with the highest bar (ggplot uses pretty to set the default axis breaks), so that the primary and secondary y-axis breaks will line up.

Finally, we use aes_ and bquote to create a quoted call, so that ggplot will recognize the passed max_y value.

set.seed(2)
test.data <- data.frame(values = replicate(1, sample(0:10,768, rep=TRUE)))

max_y = max(table(test.data$values))
max_y = max(pretty(c(0,max_y)))

ggplot(test.data, aes(x=values)) +
  geom_bar(fill="grey70") + 
  stat_ecdf(aes_(y=bquote(..y.. * .(max_y)))) + 
  scale_y_continuous(sec.axis=sec_axis(trans = ~./max_y, name="percentage", labels=percent)) +
  theme_bw()



标签: r ggplot2