scipy linregress function erroneous standard error

2019-03-27 23:24发布

问题:

I have a weird situation with scipy.stats.linregress seems to be returning an incorrect standard error:

from scipy import stats
x = [5.05, 6.75, 3.21, 2.66]
y = [1.65, 26.5, -5.93, 7.96]
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
>>> gradient
5.3935773611970186
>>> intercept
-16.281127993087829
>>> r_value
0.72443514211849758
>>> r_value**2
0.52480627513624778
>>> std_err
3.6290901222878866

Whereas Excel returns the following:

 slope: 5.394

 intercept: -16.281

 rsq: 0.525

 steyX: 11.696

steyX is excel's standard error function, returning 11.696 versus scipy's 3.63. Anybody know what's going on here? Any alternative way of getting the standard error of a regression in python, without going to Rpy?

回答1:

You could try the statsmodels package:

In [37]: import statsmodels.api as sm

In [38]: x = [5.05, 6.75, 3.21, 2.66]

In [39]: y = [1.65, 26.5, -5.93, 7.96]

In [40]: X = sm.add_constant(x) # intercept

In [41]: model = sm.OLS(y, X)

In [42]: fit = model.fit()

In [43]: fit.params
Out[43]: array([  5.39357736, -16.28112799])

In [44]: fit.rsquared
Out[44]: 0.52480627513624789

In [45]: np.sqrt(fit.mse_resid)
Out[45]: 11.696414461570097


回答2:

I've just been informed by the SciPy user group that the std_err here represents the standard error of the gradient line, not the standard error of the predicted y's, as per Excel. Nevertheless users of this function should be careful, because this was not always the behaviour of this library - it used to output exactly as Excel, and the changeover appears to have occurred in the past few months.

Anyway still looking for an equivalent to STEYX in Python.



回答3:

yes this is true - the standard estimate of the gradient is what linregress returns; the standard estimate of the estimate (Y) is related, though, and you can back-into the SEE by multiplying the standard error of the gradient (SEG) that linregress gives you: SEG = SEE / sqrt( sum of (X - average X)**2 )

Stack Exchange doesn't handle latex but the math is here if you are interested, under the "Analyze Sample Data" heading.



回答4:

The calculation of "std err on y" in excel is actually standard deviation of values of y.

That's the same for std err on x. The number '2' in the final step is the degree of freedom of example you given.

>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> def power(a):
        return a*5.3936-16.2811

>>> y_fit = list(map(power,x))
>>> y_fit
[10.956580000000002, 20.125700000000005, 1.032356, -1.934123999999997]
>>> var = [y[i]-y_fit[i] for i in range(len(y))]
>>> def pow2(a):
        return a**2

>>> summa = list(map(pow2,var))
>>> summa
[86.61243129640003, 40.63170048999993, 48.47440107073599, 97.89368972737596]
>>> total = 0
>>> for i in summa:
        total += i
>>> total
273.6122225845119
>>> import math
>>> math.sqrt(total/2)
11.696414463084658