I am using the scipy.stats.gaussian_kde
method from scipy
to generate random samples from the data.
It works fine! What I have now found out is that the method also has inbuilt functions to calculate the probability density function of the given set of points (my data).
I would like to know how it calculates the pdf provided a set of points.
Here is small example:
import numpy as np
import scipy.stats
from scipy import stats
def getDistribution1(data):
kernel = stats.gaussian_kde(data,bw_method=0.06)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
return kernel.resample(int(self._size)) #random variates
def _cdf(self, x):
return kernel.integrate_box_1d(0,max(x)) #Integrate pdf between two bounds (-inf to x here!)
def _pdf(self, x):
return kernel.evaluate(x) #Evaluate the estimated pdf on a provided set of points
return rv(name='kdedist')
test_data = np.random.random(100) # random test data
distribution_data = getDistribution1(test_data)
pdf_data = distribution_data.pdf(test_data) # the pdf of the data
In the above piece of code, there exists three methods,
rvs
to generate random samples based on datacdf
which is the integral of the pdf from 0 to max(data)pdf
which is the pdf of the data
The reason I need this pdf is because now I am trying to calculate weights for my data based on probability. So that I can give each of my data point a probability which I can then use as my weights.
I would also like to know from here how I should proceed to calculate my weights?
P.S. Forgive me for asking the same question in cross validated, there seems to be no response!
The online docs have a link to the source code, which for
gaussian_kde
is here: https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/kde.py#L193