What is the recommended way (if any) for doing linear regression using a pandas dataframe? I can do it, but my method seems very elaborate. Am I making things unnecessarily complicated?
The R code, for comparison:
x <- c(1,2,3,4,5)
y <- c(2,1,3,5,4)
M <- lm(y~x)
summary(M)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6 1.1489125 0.522233 0.6376181
x 0.8 0.3464102 2.309401 0.1040880
Now, my python (2.7.10), rpy2 (2.6.0), and pandas (0.16.1) version:
import pandas
import pandas.rpy.common as common
from rpy2 import robjects
from rpy2.robjects.packages import importr
base = importr('base')
stats = importr('stats')
dataframe = pandas.DataFrame({'x': [1,2,3,4,5],
'y': [2,1,3,5,4]})
robjects.globalenv['dataframe']\
= common.convert_to_r_dataframe(dataframe)
M = stats.lm('y~x', data=base.as_symbol('dataframe'))
print(base.summary(M).rx2('coefficients'))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6 1.1489125 0.522233 0.6376181
x 0.8 0.3464102 2.309401 0.1040880
By the way, I do get a FutureWarning on the import of pandas.rpy.common
. However, when I tried the pandas2ri.py2ri(dataframe)
to convert a dataframe from pandas to R (as mentioned here), I get
NotImplementedError: Conversion 'py2ri' not defined for objects of type '<class 'pandas.core.series.Series'>'
The R and Python are not strictly identical because you build a data frame in Python/rpy2 whereas you use vectors (without a data frame) in R.
Otherwise, the conversion shipping with
rpy2
appears to be working here:The result:
After calling
pandas2ri.activate()
some conversions from Pandas objects to R objects happen automatically. For example, you can useinstead of
yields
I can add to unutbu's answer by outlining how to retrieve particular elements of the coefficients table including, crucially, the p-values.
This leaves us with a DataFrame which we can access in the normal way: