I have a data frame like this:
Index ID Industry years_spend asset
6646 892 4 4 144.977037
2347 315 10 8 137.749138
7342 985 1 5 104.310217
137 18 5 5 156.593396
2840 381 11 2 229.538828
6579 883 11 1 171.380125
1776 235 4 7 217.734377
2691 361 1 2 148.865341
815 110 15 4 233.309491
2932 393 17 5 187.281724
I want to create dummy variables for Industry X years_spend which creates len(df.Industry.value_counts()) * len(df.years_spend.value_counts())
varaible, for example d_11_4 = 1 for all rows that has industry==1 and years spend=4 otherwise d_11_4 = 0. Then I can use these vars for some regression works.
I know I can make groups like what I want using df.groupby(['Industry','years_spend']) and I know I can create such variable for one column using patsy
syntax in statsmodels
:
import statsmodels.formula.api as smf
mod = smf.ols("income ~ C(Industry)", data=df).fit()
but If I want to do with 2 columns I get an error that:
IndexError: tuple index out of range
How can I do that with pandas or using some function inside statsmodels?
Using patsy syntax it's just:
The
:
character means "interaction"; you can also generalize this to interactions of more than two items (C(a):C(b):C(c)
), interactions between numerical and categorical values, etc. You might find the patsy docs useful.You could do something like this where you have to first create a calculated field that encapsulates the
Industry
andyears_spend
:Here's what the
df
looks like:Now you can apply
get_dummies
:That'll get you what you want :)