confusion on 2 dimension kernel density estimation

2020-08-02 06:01发布

问题:

A kernel density estimator is used to estimate a particular probability density function (see mvstat.net and sckit-learn docs for references)

My confusion is about what exactly does kde2d() do? Does it estimate the joint distribution probability density function of two random variables f(a,b) in the below example? And what does the color mean?

Here is the code example I am referring to.

b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
density <- kde2d(a, b, n=100)

colour_flow <- colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred'))
filled.contour(density, color.palette=colour_flow)

回答1:

What is a kernel density estimator? Essentially it fits a little normal density curve over every point (the center of the normal density being that point) of the data and then adds up all little normal densities to a kernel density estimator.

For the sake of illustration I will add an image of a 1 dimensional kernel density estimator from one of your links.

What about 2 dimensional kernel densities?

# library(MASS)
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
# a and b contain 1000 values each. 

density <- kde2d(a,b,n=100) 

The function creates a grid from min(a) to max(a) and from min(b) to max(b). Instead of fitting a tiny 1D normal density over every value in a or b, kde2d now fits a tiny 2D normal density over every point in the grid. Just like in the 1 dimensional case kernel density, it then adds up all density values.

What do the colours mean? As @cel pointed out in the comments: the estimated probability depends on two variables, so we have three axes now (a, b and estimated probability). One way to visualize 3 axes is by using iso-probability contours. This sounds fancy, but it is basically the same as the high/low pressure images we know from the weather forecast.

You are using

filled.contour(density, 
    color.palette = colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred')))))

So from low to high, the plot will be coloured white, blue, yellow, red and eventually darkred for the highest values of estimated probability. This results in the following plot: