I am trying to use Python for statistical analysis.
In Stata I can define local macros and expand them as necessary:
program define reg2
syntax varlist(min=1 max=1), indepvars(string) results(string)
if "`results'" == "y" {
reg `varlist' `indepvars'
}
if "`results'" == "n" {
qui reg `varlist' `indepvars'
}
end
sysuse auto, clear
So instead of:
reg2 mpg, indepvars("weight foreign price") results("y")
I could do:
local options , indepvars(weight foreign price) results(y)
reg2 mpg `options'
Or even:
local vars weight foreign price
local options , indepvars(`vars') results(y)
reg2 mpg `options'
Macros in Stata help me write clean scripts, without repeating code.
In Python I tried string interpolation but this does not work in functions.
For example:
def reg2(depvar, indepvars, results):
print(depvar)
print(indepvars)
print(results)
The following runs fine:
reg2('mpg', 'weight foreign price', 'y')
However, both of these fail:
regargs = 'mpg', 'weight foreign price', 'y'
reg2(regargs)
regargs = 'depvar=mpg, covariates=weight foreign price, results=y'
reg2(regargs)
I found a similar question but it doesn't answer my question:
- How do I create a stata local macro in python?
There is also another question about this for R:
- How do I create a "macro" for regressors in R?
However, I could not find anything for Python specifically.
I was wondering if there is anything in Python that is similar to Stata's macros?
Do it the Pythonic way.
The pervasive use of macros in Stata reflects a different programming philosophy.
Unlike Python, which is an object-oriented general purpose programming language,
Stata's ado
language (not mata
) requires macros in order to function as
something more than a simple scripting language.
Macros can be used almost anywhere in Stata (even in macro definitions) for two purposes:
- Text substitution
- Expression evaluation
Using macros, the user can simplify their code, which in turn will reduce the
potential for errors and keep it tidy. The disadvantage is that the use of macros
renders the syntax of the language fluid.
To answer your question, Pyexpander
provides some of this kind of functionality in Python but it is not really a
substitute. For different use cases you will need a different approach to mimic
macro expansion. In contrast with Stata, there is no uniform way of doing this everywhere.
My advice is to familiarize yourself with Python's conventions rather than
trying to program things the "Stata way". For example, it is useful to remember
that local and global macros in Stata correspond to variables in Python (local
in a function, global outside), while variables in Stata correspond to
Pandas.Series
or a column of a Pandas.DataFrame
. Similarly, Stata ado
programs correspond to functions in Python.
The solution provided in @g.d.d.c's answer can be a good tool towards achieving
what someone would like. However, extra steps are required here if you want to
re-use your code.
Using your toy example:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
In [1]: df[['mpg', 'weight', 'price']].head()
Out[1]:
mpg weight price
0 22 2930 4099
1 17 3350 4749
2 22 2640 3799
3 20 3250 4816
4 15 4080 7827
Let's assume you want to re-use the following snippet of code but with different
variables:
In [2]: Y = df['mpg']
In [3]: df['cons'] = 1
In [4]: X = df[['weight', 'price', 'cons']]
In [5]: reg = sm.OLS(Y, X).fit()
In [6]: print(reg.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.653
Model: OLS Adj. R-squared: 0.643
Method: Least Squares F-statistic: 66.85
Date: Prob (F-statistic): 4.73e-17
Time: Log-Likelihood: -195.22
No. Observations: 74 AIC: 396.4
Df Residuals: 71 BIC: 403.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
weight -0.0058 0.001 -9.421 0.000 -0.007 -0.005
price -9.351e-05 0.000 -0.575 0.567 -0.000 0.000
cons 19.7198 0.811 24.322 0.000 18.103 21.336
==============================================================================
Omnibus: 29.900 Durbin-Watson: 2.347
Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.190
Skew: 1.422 Prob(JB): 8.51e-14
Kurtosis: 6.382 Cond. No. 1.50e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.5e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
How could you possibly do that?
First, create a function:
def reg2(depvar, indepvars, results, df):
Y = df[depvar]
df['cons'] = 1
X = df[indepvars]
reg = sm.OLS(Y, X).fit()
if results != 0:
print(reg.summary())
However, note that although string interpolation can 'expand' strings, here
this approach will not work because the target function for regresson analysis
does not accept a unified string of the kind 'weight, price, cons'
.
Instead you need to define a list with the regressors:
predictors = ['weight', 'price', 'cons']
reg2('mpg', predictors, 0, df)
You can also take this concept to the next level by constructing a decorator:
def load_and_reg2(func):
def wrapper(*args, **kwargs):
print()
print("Loading the dataset...")
print()
df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
sumvars = df[['mpg', 'weight', 'price']].head()
print(sumvars)
print()
func(*args, **kwargs, df = df)
return func(*args, **kwargs, df = df)
print()
print("Doing any other stuff you like...")
print()
dfshape = df.shape
print('Shape:', dfshape)
return wrapper
And use this in your reg2()
function:
@load_and_reg2
def reg2(depvar, indepvars, results, df):
Y = df[depvar]
df['cons'] = 1
X = df[indepvars]
reg = sm.OLS(Y, X).fit()
if results != 0:
print(reg.summary())
return reg
The example is perhaps very simplistic but demonstrates the power of Python:
In [7]: [predictors = ['weight', 'price', 'cons']
In [8]: reg2('mpg', predictors, 1)
Loading the dataset...
mpg weight price
0 22 2930 4099
1 17 3350 4749
2 22 2640 3799
3 20 3250 4816
4 15 4080 7827
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.653
Model: OLS Adj. R-squared: 0.643
Method: Least Squares F-statistic: 66.85
Date: Prob (F-statistic): 4.73e-17
Time: Log-Likelihood: -195.22
No. Observations: 74 AIC: 396.4
Df Residuals: 71 BIC: 403.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
weight -0.0058 0.001 -9.421 0.000 -0.007 -0.005
price -9.351e-05 0.000 -0.575 0.567 -0.000 0.000
cons 39.4397 1.622 24.322 0.000 36.206 42.673
==============================================================================
Omnibus: 29.900 Durbin-Watson: 2.347
Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.190
Skew: 1.422 Prob(JB): 8.51e-14
Kurtosis: 6.382 Cond. No. 3.00e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Doing any other stuff you like...
Shape: (74, 13)
As you can see, the decorator further abstracts things but using fixed syntax.
In the Python universe dictionaries and classes also play important roles in
re-using code/results. For example, a dictionary can act as the equivalent of
Stata's return
space for storing multiple macros, scalars etc.
Consider the slightly altered version of our toy decorator load_and_reg2
, which
now saves individual objects in a dictionary D
and returns it:
def load_and_reg2(func):
def wrapper(*args, **kwargs):
D = {}
print()
print("Loading the dataset...")
print()
df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
sumvars = df[['mpg', 'weight', 'price']].head()
D['sumvars'] = sumvars
print(sumvars)
print()
D['reg2'] = func(*args, **kwargs, df)
print()
print("Doing any other stuff you like...")
print()
dfshape = df.shape
D['dfshape'] = dfshape
print('Shape:', dfshape)
return D
return wrapper
You can then easily do:
In [9]: foo = reg2('mpg', predictors, 1)
In [10]: foo.keys()
Out[10]: dict_keys(['sumvars', 'reg2', 'dfshape'])
In [11]: foo['sumvars']
Out[11]:
mpg weight price
0 22 2930 4099
1 17 3350 4749
2 22 2640 3799
3 20 3250 4816
4 15 4080 7827
Classes can introduce further flexibility at the cost
of some additional complexity:
class loadreg2return(object):
def __init__(self, sumvars=None, reg2=None, dfshape=None):
self.sumvars = sumvars
self.reg2 = reg2
self.dfshape = dfshape
def load_and_reg2(func):
def wrapper(*args, **kwargs):
print("Loading the dataset...")
print()
df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
sumvars = df[['mpg', 'weight', 'price']].head()
print(sumvars)
print()
reg2 = func(*args, **kwargs, df = df)
print()
print("Doing any other stuff you like...")
print()
dfshape = df.shape
loadreg2return(dfshape = dfshape)
print('Shape:', dfshape)
return loadreg2return(sumvars = sumvars, reg2 = reg2, dfshape = dfshape )
return wrapper
This version of our toy decorator returns:
In [12]: foo.dfshape
Out[12]: (74, 13)
In [13]: foo.sumvars
Out[13]:
mpg weight price
0 22 2930 4099
1 17 3350 4749
2 22 2640 3799
3 20 3250 4816
4 15 4080 7827
In [14]: foo.reg2.params
Out[14]:
weight -0.005818
price -0.000094
cons 39.439656
dtype: float64
It looks like you just want the *
and **
operators for calling functions:
regargs = 'mpg', 'weight foreign price', 'y'
reg2(*regargs)
Use *
to expand a list or tuple into positional arguments, or use **
to expand a dictionary into keyword arguments to a function that requires them.
For your keyword example, you need to change the declaration a little bit:
regargs = dict(depvar='mpg', covariates='weight foreign price', results='y')
reg2(**regargs)