I'm building a WLS (statsmodels.formula.api.wls
) model using the statsmodels formulas API (from patsy) and I'm using interactions between factors. Some of these are predictive whereas others are not. Is there a way to include only a subset of the interactions in the model without resorting to building a design matrix by hand?
Alternatively, is there a way to constrain the estimated coefficients of a subset of the model variables to be equal to zero?
I'm not sure I understand exactly what you need, but I suggest you start with the truly excellent pasty docs (patsy handles formulas for statsmodels). There's a nice section on categorical data: http://patsy.readthedocs.org/en/latest/index.html
My guess is that what you want is going to be hard to achieve with a single formula call. I would probably just use patsy to build a design matrix with more terms than I need and then drop columns. For example:
In [28]: import statsmodels.formula.api as sm
In [29]: import pandas as pd
In [30]: import numpy as np
In [31]: import patsy
In [32]: url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
In [33]: df = pd.read_csv(url)
In [34]: w = np.ones(df.shape[0])
In [35]: f = 'Lottery ~ Wealth : C(Region)'
In [36]: y,X = patsy.dmatrices(f, df, return_type='dataframe')
In [37]: X.head()
Out[37]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns:
Intercept 5 non-null values
Wealth:C(Region)[nan] 5 non-null values
Wealth:C(Region)[C] 5 non-null values
Wealth:C(Region)[E] 5 non-null values
Wealth:C(Region)[N] 5 non-null values
Wealth:C(Region)[S] 5 non-null values
Wealth:C(Region)[W] 5 non-null values
dtypes: float64(7)
In [38]: X = X.ix[:,[2,3,4]]
In [39]: X.head()
Out[39]:
Wealth:C(Region)[C] Wealth:C(Region)[E] Wealth:C(Region)[N]
0 0 73 0
1 0 0 22
2 61 0 0
3 0 76 0
4 0 83 0
In [40]: mod = sm.WLS(y, X, 1./w).fit()
In [41]: mod.params
Out[41]:
Wealth:C(Region)[C] 1.084430
Wealth:C(Region)[E] 0.650396
Wealth:C(Region)[N] 1.021582
I don't understand what you mean by "a subset of the interactions". One thing you might mean is a formula like
y ~ pred1 + pred2 + pred3 + pred1:pred3 + pred1:pred2
or the equivalent
y ~ pred1*pred2*pred3 - pred2:pred3
where the latter makes it obvious that we're including some of the possible interactions, but not all of them (we've left out pred2:pred3
).
But, this is easy to do, so I'm guessing that what you actually meant may be, you want to include a subset of the coefficients associated with a single interaction? If so, then no, that isn't something that's currently implemented. It's fairly dubious from a statistical perspective as well; if you start leaving out random columns, then you change the interpretation of all the other columns in very difficult to interpret ways. Also I can't really think of a good implementable syntax for describing the partial interaction you want... if you can then feel free to file a feature request on patsy.
Also, I don't believe that statsmodels includes a way to fit a restricted model like that, no. It would be a good feature request.