Set confidence levels in seaborn kdeplot

2019-04-11 09:42发布

问题:

I'm completely new to seaborn, so apologies if this is a simple question, but I cannot find anywhere in the documentation a description of how the levels plotted by n_levels are controlled in kdeplot. This is an example:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

x,y=np.random.randn(2,10000)

fig,ax=plt.subplots()
sns.kdeplot(x,y, shade=True,shade_lowest=False, ax=ax,n_levels=3,cmap="Reds")
plt.show()

This is the resulting plot:

I would like to be able to know what confidence levels are shown, so that I can label my plot "shaded regions show the (a,b,c) percentage confidence intervals." I would naively assume that n_levels is somehow related to equivalent "sigmas" in a Gaussian, but from the example that doesn't look to be the case.

Ideally, I would like to be able to specify the intervals shown by passing a tuple to kdeplot, such as:

levels=[68,95,99]

and plot these confidence regions.

EDIT: Thanks to @Goyo and @tom I think I can clarify my question, and come partway to the answer I am looking for. As pointed out, n_levels is passed to plt.cotourf as levels, and so a list can be passed. But sns.kdeplot plots the PDF, and the values in the PDF don't correspond to the confidence intervals I am looking for (since these correspond to integration of the PDF).

What I need to do is pass sns.kdeplot the x,y values of the integrated (and normalized) PDF, and then I will be able to enter e.g. n_levels=[0.68,0.95,0.99,1].

EDIT 2: I have now solved this problem. See below. I use a 2d normed histogram to define the confidence intervals, which I then pass as levels to the normed kde plot. Apologies for repetition, I could make a function to return levels, but I typed it all out explicitly.

import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
import seaborn as sns

# Generate some random data
x,y=np.random.randn(2,100000)

# Make a 2d normed histogram
H,xedges,yedges=np.histogram2d(x,y,bins=40,normed=True)

norm=H.sum() # Find the norm of the sum
# Set contour levels
contour1=0.99
contour2=0.95
contour3=0.68

# Set target levels as percentage of norm
target1 = norm*contour1
target2 = norm*contour2
target3 = norm*contour3

# Take histogram bin membership as proportional to Likelihood
# This is true when data comes from a Markovian process
def objective(limit, target):
    w = np.where(H>limit)
    count = H[w]
    return count.sum() - target

# Find levels by summing histogram to objective
level1= scipy.optimize.bisect(objective, H.min(), H.max(), args=(target1,))
level2= scipy.optimize.bisect(objective, H.min(), H.max(), args=(target2,))
level3= scipy.optimize.bisect(objective, H.min(), H.max(), args=(target3,))

# For nice contour shading with seaborn, define top level
level4=H.max()
levels=[level1,level2,level3,level4]

# Pass levels to normed kde plot
fig,ax=plt.subplots()
sns.kdeplot(x,y, shade=True,ax=ax,n_levels=levels,cmap="Reds_d",normed=True)
ax.set_aspect('equal')
plt.show()

The resulting plot is now the following:

The levels are slightly wider than I expect, but I think this is correct.

回答1:

The levels are not confidente intervals or sigmas but values of the estimated pdf. You are able to pass the levels as a list instead as n_levels.

EDIT

Seaborn just plot things. It won't give you the estimated pdf, just a matplotlib axes. So if you want do do calculations with the kde pdf you'll have to estimate it by yourself.

Seaborn uses statsmodels or scipy under the hood so you can do the same. Statsmodels can give you also the cdf if that is what you are looking for (and maybe scipy but I am not sure). You can compute the levels you are interested in, evaluate the pdf in a grid and pass everything to contourf, which is more or less what seaborn does.

Unfortunately I am not skilled enough yo give you more advice on this (I just use statsmodels for OLS regressions every now and then) but you can look at the code of kdeplotand figure out.



回答2:

I was just facing the same problem. Something that I don't understand is that why the confidence levels and so the plot changes when the number of bins is changed. You chose bin=40 in the histogram but if you change it you get a different plot.