I would like to use ggplot2 to illustrate the difference between two similar density distributions. Here is a toy example of the type of data I have:
library(ggplot2)
# Make toy data
n_sp <- 100000
n_dup <- 50000
D <- data.frame(
event=c(rep("sp", n_sp), rep("dup", n_dup) ),
q=c(rnorm(n_sp, mean=2.0), rnorm(n_dup, mean=2.1))
)
# Standard density plot
ggplot( D, aes( x=q, y=..density.., col=event ) ) +
geom_freqpoly()
Rather than separately plot the density for each category ( dup
and sp
) as above, how could I plot a single line that shows the difference between these distributions?
In the toy example above, if I subtracted the dup
density distribution from the sp
density distribution, the resulting line would be above zero on the left side of the plot (since there is an abundance of smaller sp
values) and below 0 on the right (since there is an abundance of larger dup
values). Not that there may be a different number of observations of type dup
and sp
.
More generally - what is the best way to show differences between similar density distributions?
There may be a way to do this within ggplot, but frequently it's easiest to do the calculations beforehand. In this case, call
density
on each subset ofq
over the same range, then subtract the y values. Using dplyr (translate to base R or data.table if you wish),