Why do stat_density (R; ggplot2) and gaussian_kde

2019-07-25 16:17发布

I am attempting to produce a KDE based PDF estimate on a series of distributions that may not be normally distributed.

I like the way ggplot's stat_density in R seems to recognize every incremental bump in frequency, but cannot replicate this via Python's scipy-stats-gaussian_kde method, which seems to oversmooth.

I have set up my R code as follows:

ggplot(test, aes(x=Val, color = as.factor(Class), group=as.factor(Class))) +
             stat_density(geom='line',kernel='gaussian',bw='nrd0' 
                                                            #nrd0='Silverman'
                                                            ,size=1,position='identity')

R example

And my python code is:

kde = stats.gaussian_kde(data.ravel())
kde.set_bandwidth(bw_method='silverman')

python example

Stats docs show here that nrd0 is the silverman method for bw adjust.

Based on the code above, I am using the same kernals (gaussian) and bandwidth methods (Silverman).

Can anyone explain why the results are so different?

1条回答
叼着烟拽天下
2楼-- · 2019-07-25 16:58

There seems to be disagreement about what Silverman's Rule is.

The scipy docs say that Silverman's rule is implemented as:

def silverman_factor(self):
    return power(self.neff*(self.d+2.0)/4.0, -1./(self.d+4))

Where d is the number of dimensions (1, in your case) and neff is the effective sample size (number of points, assuming no weights). So the scipy bandwidth is (n * 3 / 4) ^ (-1 / 5) (times the standard deviation, computed in a different method).

By contrast, R's stats package docs describe Silverman's method as "0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power", which can also be verified in R code, typing bw.nrd0 in the console gives:

function (x) 
{
    if (length(x) < 2L) 
        stop("need at least 2 data points")
    hi <- sd(x)
    if (!(lo <- min(hi, IQR(x)/1.34))) 
        (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    0.9 * lo * length(x)^(-0.2)
}

Wikipedia, on the other hand, gives "Silverman's Rule of Thumb" as one of many possible names for the estimator:

1.06 * sigma * n ^ (-1 / 5)

The wikipedia version is equivalent to the scipy version.

All three sources cite the same reference: Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC. p. 48. ISBN 978-0-412-24620-3. Wikipedia and R specifically cite page 48, whereas scipy's docs do not mention a page number. (I have submitted an edit to Wikipedia to update its page reference to p.45, see below.)


Addendum

I found a PDF of the Silverman reference.

On page 45, equation 3.28 is what is used in the Wikipedia article: (4 / 3) ^ (1 / 5) * sigma * n ^ (-1 / 5) ~= 1.06 * sigma * n ^ (-1 / 5). Scipy uses the same method, rewriting (3 / 4) ^ (-1 / 5) as the equivalent (4 / 3) ^ (1 / 5). Silverman describes this method as:

While (3.28) will work well if the population really is normally distributed, it may oversmooth somewhat if the population is multimodal... as the mixture becomes more strongly bimodal the formula (3.28) will oversmooth more and more, relative to the optimal choice of smoothing parameter.

The scipy docs reference this weakness, stating:

It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

The Silverman article continues to motivate the method used by R and Stata. On page 48, we get the equation 3.31:

h = 0.9 * A * n ^ (-1 / 5)
# A defined on previous page, eqn 3.30
A = min(standard deviation, interquartile range / 1.34)

Silverman describes this method as:

The best of both possible worlds... In summary, the choice ([eqn] 3.31) for the smoothing parameter will do very well for a wide range of densities and is trivial to evaluate. For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine-tuning.

So, it seems Wikipedia and Scipy use a simple estimator proposed by Silverman. R and Stata use a more refined version.

查看更多
登录 后发表回答