require(ggplot2)
require(cowplot)
d = iris
ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) +
geom_violin(fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
, colour = "red", size = 1.5) +
stat_boxplot(geom ='errorbar', width = 0.1)+
geom_boxplot(width = 0.2)+
facet_grid(. ~ Species, scales = "free_x") +
xlab("") +
ylab (expression(paste("Value"))) +
coord_cartesian(ylim = c(3.5,9.5)) +
scale_y_continuous(breaks = seq(4, 9, 1)) +
theme(axis.text.x=element_blank(),
axis.text.y = element_text(size = rel(1.5)),
axis.ticks.x = element_blank(),
strip.background=element_rect(fill="black"),
strip.text=element_text(color="white", face="bold"),
legend.position = "none") +
background_grid(major = "xy", minor = "none")
To my knowledge box ends in boxplots represent the 25% and 75% quantile, respectively, and the median = 50%. So they should be equal to the 0.25/0.5/0.75 quantiles which are drawn by geom_violin
in the draw_quantiles = c(0.25, 0.5, 0.75)
argument.
Median and 50% quantile fit. However, both 0.25 and 0.75 quantile do not fit the box ends of the boxplot (see figure, especially 'virginica' facet).
References:
http://docs.ggplot2.org/current/geom_violin.html
http://docs.ggplot2.org/current/geom_boxplot.html
This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the boxplot
refers to boxplot.stats
, which uses hinges
that are very similar but not necessarily identical to the quantiles. ?boxplot.stats
says:
The two ‘hinges’ are versions of the first and third quartile, i.e.,
close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd
n (where n <- length(x)) and differ for even n. Whereas the quartiles
only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do
so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle
of two observations otherwise.
The hinge vs quantile
distinction could thus be one source for the difference.
Second, geom_violin
refers to a density estimate. The source code here points to a function StatYdensity
, which leads me to here. I could not find the function compute_density
, but I think (also due to some pointers in help files) it is essentially density
, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but
by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats )
by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x))
do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.