Python - Statistical distribution

2019-09-21 17:42发布

问题:

I'm quite new to python world. Also, I'm not a statistician. I'm in the need to implementing mathematical models developed by mathematicians in a computer science programming language. I've chosen python after some research. I'm comfortable with programming as such (PHP/HTML/javascript).

I have a column of values that I've extracted from a MySQL database & in need to calculate the below -

1) Normal distribution of it. (I don't have the sigma & mu values. These need to be calculated too apparently). 
2) Mixture of normal distribution
3) Estimate density of normal distribution
4) Calculate 'Z' score

The array of values looks similar to the one below ( I've populated sample data)-

d1 = [3,3,3,3,3,3,3,9,12,6,3,3,3,3,9,21,3,12,3,6,3,30,12,6,3,3,24,30,3,3,3]


mu1, std1 = norm.fit(d1)

The normal distribution, I understand could be calculated as below -

import numpy as np
from scipy.stats import norm

mu, std = norm.fit(data)

Could I please get some pointers on how to get started with (2),(3) & (4) in this please? I'm continuing to look up online as I look forward to hear from experts.

If the question doesn't fully make sense, please do let me know what aspect is missing so that I'll try & get information around that.

I'd very much appreciate any help here please.

回答1:

Some parts of your question are unclear. It might help to give the context of what you're trying to achieve, rather than what are the specific steps you're taking.

1) + 3) In a Normal distribution - fitting the distribution, and estimating the mean and standard deviation - are basically the same thing. The mean and standard deviation completely determine the distribution.

mu, std = norm.fit(data)

is tantamount to saying "find the mean and standard deviation which best fit the distribution".

4) Calculating the Z score - you'll have to explain what you're trying to do. This usually means how much above (or below) the mean a data point is, in units of standard deviation. Is this what you need here? If so, then it is simply

(np.array(data) - mu) / std

2) Mixture of normal distribution - this is completely unclear. It usually means that the distribution is actually generated by more than a single Normal distribution. What do you mean by this?



回答2:

About (2), a web search for "mixture of Gaussians Python" should turn up a lot of hits.

The mixture of Gaussians is a pretty simple idea -- instead of a single Gaussian bump, the density contains multiple bumps. The density is a weighted sum $\sum_k \alpha_k g(x, \mu_k, \sigma_k^2)$ where the weights $\alpha_k$ are positive and sum to 1, and $g(x, \mu, \sigma^2)$ is a single Gaussian bump.

To determine the parameters $\alpha_k$, $\mu_k$, and $\sigma_k^2$, typically one uses the so-called expectation-maximization (EM) algorithm. Again a web search should find many hits. The EM algorithm for a Gaussian mixture is implemented in some Python libraries. It is not too complicated to write it yourself, but maybe to get started you can use an existing implementation.