Fitting a distribution given the histogram using s

2019-07-13 07:14发布

I would like to fit a distribution using scipy (in my case, using weibull_min) to my data. Is it possible to do this given the Histogram, and not the data points? In my case, because the histogram has integer bins of size 1, I know that I can extrapolate my data in the following way:

import numpy as np
orig_hist = np.array([10, 5, 3, 2, 1])

ext_data = reduce(lambda x,y: x+y, [[i]*x for i, x in enumerate(orig_hist)])

In this case, ext_data would hold this:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]

And building the histogram using:

np.histogram(ext_data, bins=5)

would be equivalent to orig_hist

Yet, given that I already have the histogram built, I would like to avoid extrapolating the data and use orig_hist to fit the distribution, but I don't know if it is possible to use it directly in the fitting procedure. Additionally, is there a numpy function that can be used to perform something similar to the extrapolation I showed?

1条回答
We Are One
2楼-- · 2019-07-13 07:47

I might be misunderstanding something, but I believe that fitting to the histogram is exactly what you should do: you're trying to approximate the probability density. And the histogram is as close as you can get to the underlying probability density. You just have to normalize it in order to have an integral of 1, or allow your fitted model to contain an arbitrary prefactor.

import numpy as np
import scipy.stats as stats
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([10, 5, 3, 2, 1])
norm_hist = orig_hist/float(sum(orig_hist))

popt,pcov = opt.curve_fit(lambda x,c: stats.weibull_min.pdf(x,c), np.arange(len(norm_hist)),norm_hist)

plt.figure()
plt.plot(norm_hist,'o-',label='norm_hist')
plt.plot(stats.weibull_min.pdf(np.arange(len(norm_hist)),popt),'s-',label='Weibull_min fit')
plt.legend()

Of course for your given input the Weibull fit will be far from satisfactory:

fit to data

Update

As I mentioned above, Weibull_min is a poor fit to your sample input. The bigger problem is that it is also a poor fit to your actual data:

orig_hist = np.array([ 23., 14., 13., 12., 12., 12., 11., 11., 11., 11., 10., 10., 10., 9., 9., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.], dtype=np.float32)

new histogram data

There are two main problems with this histogram. The first, as I said, is that it is unlikely to correspond to a Weibull_min distribution: it is maximal near zero and has a long tail, so it needs a non-trivial combination of Weibull parameters. Furthermore, your histogram clearly only contains a part of the distribution. This implies that my normalizing suggestion above is guaranteed to fail. You can't avoid using an arbitrary scale parameter in your fit.

I manually defined a scaled Weibull fitting function according to the formula on Wikipedia:

my_weibull = lambda x,l,c,A: A*float(c)/l*(x/float(l))**(c-1)*np.exp(-(x/float(l))**c)

In this function x is the independent variable, l is lambda (the scale parameter), c is k (the shape parameter) and A is a scaling prefactor. The faint upside of introducing A is that you don't have to normalize your histogram.

Now, when I dropped this function into scipy.optimize.curve_fit, I found what you did: it doesn't actually perform a fit, but sticks with the initial fitting parameters, whatever you set (using the p0 parameter; the default guesses are all 1 for every parametr). And curve_fit seems to think that the fitting converged.

After more than an hour's wall-related head-banging, I realized that the problem is that the singular behaviour at x=0 throws off the nonlinear least-squares algorithm. By excluding your very first data point, you get an actual fit to your data. I suspect that if we set c=1 and don't allow that to fit, then this problem might go away, but it is probably more informative to allow that to be fitted (so I didn't check).

Here's the corresponding code:

import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([ 23., 14., 13., 12., 12., 12., 11., 11., 11., 11., 10., 10., 10., 9., 9., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.], dtype=np.float32)

my_weibull = lambda x,l,c,A: A*float(c)/l*(x/float(l))**(c-1)*np.exp(-(x/float(l))**c)

popt,pcov = opt.curve_fit(my_weibull,np.arange(len(orig_hist))[1:],orig_hist[1:]) #throw away x=0!

plt.figure()
plt.plot(np.arange(len(orig_hist)),orig_hist,'o-',label='orig_hist')
plt.plot(np.arange(len(orig_hist)),my_weibull(np.arange(len(orig_hist)),*popt),'s-',label='Scaled Weibull fit')
plt.legend()

Result:

new fit

In [631]: popt
Out[631]: array([  1.10511850e+02,   8.82327822e-01,   1.05206207e+03])

the final fitted parameters are in the order (l,c,A), with the shape parameter of around 0.88. This corresponds to a diverging probability density, which explains why a few errors pop up saying

RuntimeWarning: invalid value encountered in power

and why there isn't a data point from the fitting for x=0. But judging from the visual agreement between data and fit, you can assess whether the result is acceptable or not.

If you want to overdo it, you can probably try generating points using np.random.weibull with these parameters, then comparing the resulting histograms with your own.

查看更多
登录 后发表回答