Differing quantiles: Boxplot vs. Violinplot

2019-05-01 19:21发布

问题:

require(ggplot2)
require(cowplot)
d = iris

ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) + 
    geom_violin(fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
                , colour = "red", size = 1.5) +
    stat_boxplot(geom ='errorbar', width = 0.1)+
    geom_boxplot(width = 0.2)+
    facet_grid(. ~ Species, scales = "free_x") +
    xlab("") + 
    ylab (expression(paste("Value"))) +
    coord_cartesian(ylim = c(3.5,9.5)) + 
    scale_y_continuous(breaks = seq(4, 9, 1)) + 
    theme(axis.text.x=element_blank(),
          axis.text.y = element_text(size = rel(1.5)),
          axis.ticks.x = element_blank(),
          strip.background=element_rect(fill="black"),
          strip.text=element_text(color="white", face="bold"),
          legend.position = "none") +
    background_grid(major = "xy", minor = "none") 

To my knowledge box ends in boxplots represent the 25% and 75% quantile, respectively, and the median = 50%. So they should be equal to the 0.25/0.5/0.75 quantiles which are drawn by geom_violin in the draw_quantiles = c(0.25, 0.5, 0.75) argument.

Median and 50% quantile fit. However, both 0.25 and 0.75 quantile do not fit the box ends of the boxplot (see figure, especially 'virginica' facet).

References:

  1. http://docs.ggplot2.org/current/geom_violin.html

  2. http://docs.ggplot2.org/current/geom_boxplot.html

回答1:

This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the boxplot refers to boxplot.stats, which uses hinges that are very similar but not necessarily identical to the quantiles. ?boxplot.stats says:

The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.

The hinge vs quantile distinction could thus be one source for the difference.

Second, geom_violin refers to a density estimate. The source code here points to a function StatYdensity, which leads me to here. I could not find the function compute_density, but I think (also due to some pointers in help files) it is essentially density, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but

by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats )
by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x))

do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.