I would like to fit a distribution using scipy (in my case, using weibull_min) to my data. Is it possible to do this given the Histogram, and not the data points? In my case, because the histogram has integer bins of size 1, I know that I can extrapolate my data in the following way:
import numpy as np
orig_hist = np.array([10, 5, 3, 2, 1])
ext_data = reduce(lambda x,y: x+y, [[i]*x for i, x in enumerate(orig_hist)])
In this case, ext_data would hold this:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
And building the histogram using:
np.histogram(ext_data, bins=5)
would be equivalent to orig_hist
Yet, given that I already have the histogram built, I would like to avoid extrapolating the data and use orig_hist to fit the distribution, but I don't know if it is possible to use it directly in the fitting procedure. Additionally, is there a numpy function that can be used to perform something similar to the extrapolation I showed?
I might be misunderstanding something, but I believe that fitting to the histogram is exactly what you should do: you're trying to approximate the probability density. And the histogram is as close as you can get to the underlying probability density. You just have to normalize it in order to have an integral of 1, or allow your fitted model to contain an arbitrary prefactor.
Of course for your given input the Weibull fit will be far from satisfactory:
Update
As I mentioned above, Weibull_min is a poor fit to your sample input. The bigger problem is that it is also a poor fit to your actual data:
There are two main problems with this histogram. The first, as I said, is that it is unlikely to correspond to a Weibull_min distribution: it is maximal near zero and has a long tail, so it needs a non-trivial combination of Weibull parameters. Furthermore, your histogram clearly only contains a part of the distribution. This implies that my normalizing suggestion above is guaranteed to fail. You can't avoid using an arbitrary scale parameter in your fit.
I manually defined a scaled Weibull fitting function according to the formula on Wikipedia:
In this function
x
is the independent variable,l
islambda
(the scale parameter),c
isk
(the shape parameter) andA
is a scaling prefactor. The faint upside of introducingA
is that you don't have to normalize your histogram.Now, when I dropped this function into
scipy.optimize.curve_fit
, I found what you did: it doesn't actually perform a fit, but sticks with the initial fitting parameters, whatever you set (using thep0
parameter; the default guesses are all 1 for every parametr). Andcurve_fit
seems to think that the fitting converged.After more than an hour's wall-related head-banging, I realized that the problem is that the singular behaviour at
x=0
throws off the nonlinear least-squares algorithm. By excluding your very first data point, you get an actual fit to your data. I suspect that if we setc=1
and don't allow that to fit, then this problem might go away, but it is probably more informative to allow that to be fitted (so I didn't check).Here's the corresponding code:
Result:
the final fitted parameters are in the order
(l,c,A)
, with the shape parameter of around0.88
. This corresponds to a diverging probability density, which explains why a few errors pop up sayingand why there isn't a data point from the fitting for
x=0
. But judging from the visual agreement between data and fit, you can assess whether the result is acceptable or not.If you want to overdo it, you can probably try generating points using
np.random.weibull
with these parameters, then comparing the resulting histograms with your own.