可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a pandas
data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50],
"B": [20, 30, 10, 40, 50],
"C": [32, 234, 23, 23, 42523]})
Ideally, I would have something like ols(A ~ B + C, data = df)
but when I look at the examples from algorithm libraries like scikit-learn
it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?
回答1:
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas
' optional dependencies before pandas
' version 0.20.0 (it was used for a few things in pandas.stats
.)
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print result.params
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
回答2:
Note: pandas.stats
has been removed with 0.20.0
It's possible to do this with pandas.stats.ols
:
>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
Note that you need to have statsmodels
package installed, it is used internally by the pandas.stats.ols
function.
回答3:
I don't know if this is new in sklearn
or pandas
, but I'm able to pass the data frame directly to sklearn
without converting the data frame to a numpy array or any other data types.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])
>>> reg.coef_
array([ 4.01182386e-01, 3.51587361e-04])
回答4:
This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.
No it doesn't, just convert to a NumPy array:
>>> data = np.asarray(df)
This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([ 4.01182386e-01, 3.51587361e-04])
>>> lr.intercept_
14.952479503953672