I have a weird situation with scipy.stats.linregress seems to be returning an incorrect standard error:
from scipy import stats
x = [5.05, 6.75, 3.21, 2.66]
y = [1.65, 26.5, -5.93, 7.96]
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
>>> gradient
5.3935773611970186
>>> intercept
-16.281127993087829
>>> r_value
0.72443514211849758
>>> r_value**2
0.52480627513624778
>>> std_err
3.6290901222878866
Whereas Excel returns the following:
slope: 5.394
intercept: -16.281
rsq: 0.525
steyX: 11.696
steyX is excel's standard error function, returning 11.696 versus scipy's 3.63. Anybody know what's going on here? Any alternative way of getting the standard error of a regression in python, without going to Rpy?
You could try the statsmodels package:
In [37]: import statsmodels.api as sm
In [38]: x = [5.05, 6.75, 3.21, 2.66]
In [39]: y = [1.65, 26.5, -5.93, 7.96]
In [40]: X = sm.add_constant(x) # intercept
In [41]: model = sm.OLS(y, X)
In [42]: fit = model.fit()
In [43]: fit.params
Out[43]: array([ 5.39357736, -16.28112799])
In [44]: fit.rsquared
Out[44]: 0.52480627513624789
In [45]: np.sqrt(fit.mse_resid)
Out[45]: 11.696414461570097
I've just been informed by the SciPy user group that the std_err here represents the standard error of the gradient line, not the standard error of the predicted y's, as per Excel. Nevertheless users of this function should be careful, because this was not always the behaviour of this library - it used to output exactly as Excel, and the changeover appears to have occurred in the past few months.
Anyway still looking for an equivalent to STEYX in Python.
yes this is true - the standard estimate of the gradient is what linregress returns; the standard estimate of the estimate (Y) is related, though, and you can back-into the SEE by multiplying the standard error of the gradient (SEG) that linregress gives you: SEG = SEE / sqrt( sum of (X - average X)**2 )
Stack Exchange doesn't handle latex but the math is here if you are interested, under the "Analyze Sample Data" heading.
The calculation of "std err on y" in excel is actually standard deviation of values of y.
That's the same for std err on x. The number '2' in the final step is the degree of freedom of example you given.
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> def power(a):
return a*5.3936-16.2811
>>> y_fit = list(map(power,x))
>>> y_fit
[10.956580000000002, 20.125700000000005, 1.032356, -1.934123999999997]
>>> var = [y[i]-y_fit[i] for i in range(len(y))]
>>> def pow2(a):
return a**2
>>> summa = list(map(pow2,var))
>>> summa
[86.61243129640003, 40.63170048999993, 48.47440107073599, 97.89368972737596]
>>> total = 0
>>> for i in summa:
total += i
>>> total
273.6122225845119
>>> import math
>>> math.sqrt(total/2)
11.696414463084658