Statsmodels formula API (patsy): How to exclude a

2019-07-09 06:52发布

I'm building a WLS (statsmodels.formula.api.wls) model using the statsmodels formulas API (from patsy) and I'm using interactions between factors. Some of these are predictive whereas others are not. Is there a way to include only a subset of the interactions in the model without resorting to building a design matrix by hand?

Alternatively, is there a way to constrain the estimated coefficients of a subset of the model variables to be equal to zero?

2条回答
在下西门庆
2楼-- · 2019-07-09 07:40

I don't understand what you mean by "a subset of the interactions". One thing you might mean is a formula like

y ~ pred1 + pred2 + pred3 + pred1:pred3 + pred1:pred2

or the equivalent

y ~ pred1*pred2*pred3 - pred2:pred3

where the latter makes it obvious that we're including some of the possible interactions, but not all of them (we've left out pred2:pred3).

But, this is easy to do, so I'm guessing that what you actually meant may be, you want to include a subset of the coefficients associated with a single interaction? If so, then no, that isn't something that's currently implemented. It's fairly dubious from a statistical perspective as well; if you start leaving out random columns, then you change the interpretation of all the other columns in very difficult to interpret ways. Also I can't really think of a good implementable syntax for describing the partial interaction you want... if you can then feel free to file a feature request on patsy.

Also, I don't believe that statsmodels includes a way to fit a restricted model like that, no. It would be a good feature request.

查看更多
我欲成王,谁敢阻挡
3楼-- · 2019-07-09 07:42

I'm not sure I understand exactly what you need, but I suggest you start with the truly excellent pasty docs (patsy handles formulas for statsmodels). There's a nice section on categorical data: http://patsy.readthedocs.org/en/latest/index.html

My guess is that what you want is going to be hard to achieve with a single formula call. I would probably just use patsy to build a design matrix with more terms than I need and then drop columns. For example:

In [28]: import statsmodels.formula.api as sm
In [29]: import pandas as pd
In [30]: import numpy as np
In [31]: import patsy
In [32]: url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
In [33]: df = pd.read_csv(url)
In [34]: w = np.ones(df.shape[0])
In [35]: f = 'Lottery ~ Wealth : C(Region)'
In [36]: y,X = patsy.dmatrices(f, df, return_type='dataframe')
In [37]: X.head()
Out[37]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns:
Intercept                5  non-null values
Wealth:C(Region)[nan]    5  non-null values
Wealth:C(Region)[C]      5  non-null values
Wealth:C(Region)[E]      5  non-null values
Wealth:C(Region)[N]      5  non-null values
Wealth:C(Region)[S]      5  non-null values
Wealth:C(Region)[W]      5  non-null values
dtypes: float64(7)

In [38]: X = X.ix[:,[2,3,4]]
In [39]: X.head()
Out[39]: 
   Wealth:C(Region)[C]  Wealth:C(Region)[E]  Wealth:C(Region)[N]
0                    0                   73                    0
1                    0                    0                   22
2                   61                    0                    0
3                    0                   76                    0
4                    0                   83                    0

In [40]: mod = sm.WLS(y, X, 1./w).fit()
In [41]: mod.params
Out[41]: 
Wealth:C(Region)[C]    1.084430
Wealth:C(Region)[E]    0.650396
Wealth:C(Region)[N]    1.021582
查看更多
登录 后发表回答