Equivalent of Stata macros in Python

2020-02-26 15:24发布

问题:

I am trying to use Python for statistical analysis.

In Stata I can define local macros and expand them as necessary:

program define reg2
    syntax varlist(min=1 max=1), indepvars(string) results(string)
    if "`results'" == "y" {
        reg `varlist' `indepvars'
    }
    if "`results'" == "n" {
        qui reg `varlist' `indepvars'
    }
end

sysuse auto, clear

So instead of:

reg2 mpg, indepvars("weight foreign price") results("y")

I could do:

local options , indepvars(weight foreign price) results(y) 
reg2 mpg `options'

Or even:

local vars weight foreign price
local options , indepvars(`vars') results(y) 
reg2 mpg `options'

Macros in Stata help me write clean scripts, without repeating code.

In Python I tried string interpolation but this does not work in functions.

For example:

def reg2(depvar, indepvars, results):
    print(depvar)
    print(indepvars)
    print(results)

The following runs fine:

reg2('mpg', 'weight foreign price', 'y')

However, both of these fail:

regargs = 'mpg', 'weight foreign price', 'y'
reg2(regargs)

regargs = 'depvar=mpg, covariates=weight foreign price, results=y'
reg2(regargs)

I found a similar question but it doesn't answer my question:

  • How do I create a stata local macro in python?

There is also another question about this for R:

  • How do I create a "macro" for regressors in R?

However, I could not find anything for Python specifically.

I was wondering if there is anything in Python that is similar to Stata's macros?

回答1:

Do it the Pythonic way.

The pervasive use of macros in Stata reflects a different programming philosophy. Unlike Python, which is an object-oriented general purpose programming language, Stata's ado language (not mata) requires macros in order to function as something more than a simple scripting language.

Macros can be used almost anywhere in Stata (even in macro definitions) for two purposes:

  1. Text substitution
  2. Expression evaluation

Using macros, the user can simplify their code, which in turn will reduce the potential for errors and keep it tidy. The disadvantage is that the use of macros renders the syntax of the language fluid.

To answer your question, Pyexpander provides some of this kind of functionality in Python but it is not really a substitute. For different use cases you will need a different approach to mimic macro expansion. In contrast with Stata, there is no uniform way of doing this everywhere.

My advice is to familiarize yourself with Python's conventions rather than trying to program things the "Stata way". For example, it is useful to remember that local and global macros in Stata correspond to variables in Python (local in a function, global outside), while variables in Stata correspond to Pandas.Series or a column of a Pandas.DataFrame. Similarly, Stata ado programs correspond to functions in Python.

The solution provided in @g.d.d.c's answer can be a good tool towards achieving what someone would like. However, extra steps are required here if you want to re-use your code.

Using your toy example:

import pandas as pd   
import numpy as np      
import statsmodels.api as sm

df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')

In [1]: df[['mpg', 'weight', 'price']].head()
Out[1]: 
   mpg  weight  price
0   22    2930   4099
1   17    3350   4749
2   22    2640   3799
3   20    3250   4816
4   15    4080   7827

Let's assume you want to re-use the following snippet of code but with different variables:

In [2]: Y = df['mpg']
In [3]: df['cons'] = 1
In [4]: X = df[['weight', 'price', 'cons']]

In [5]: reg = sm.OLS(Y, X).fit()
In [6]: print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.653
Model:                            OLS   Adj. R-squared:                  0.643
Method:                 Least Squares   F-statistic:                     66.85
Date:                                   Prob (F-statistic):           4.73e-17
Time:                                   Log-Likelihood:                -195.22
No. Observations:                  74   AIC:                             396.4
Df Residuals:                      71   BIC:                             403.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
weight        -0.0058      0.001     -9.421      0.000      -0.007      -0.005
price      -9.351e-05      0.000     -0.575      0.567      -0.000       0.000
cons          19.7198      0.811     24.322      0.000      18.103      21.336
==============================================================================
Omnibus:                       29.900   Durbin-Watson:                   2.347
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               60.190
Skew:                           1.422   Prob(JB):                     8.51e-14
Kurtosis:                       6.382   Cond. No.                     1.50e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.5e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

How could you possibly do that?

First, create a function:

def reg2(depvar, indepvars, results, df):
    Y = df[depvar]
    df['cons'] = 1
    X = df[indepvars]

    reg = sm.OLS(Y, X).fit()
    if results != 0:
        print(reg.summary())

However, note that although string interpolation can 'expand' strings, here this approach will not work because the target function for regresson analysis does not accept a unified string of the kind 'weight, price, cons'.

Instead you need to define a list with the regressors:

predictors = ['weight', 'price', 'cons']
reg2('mpg', predictors, 0, df)

You can also take this concept to the next level by constructing a decorator:

def load_and_reg2(func):
    def wrapper(*args, **kwargs):

        print()
        print("Loading the dataset...")
        print()

        df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
        sumvars = df[['mpg', 'weight', 'price']].head()
        print(sumvars)
        print()

        func(*args, **kwargs, df = df)
        return func(*args, **kwargs, df = df)

        print()
        print("Doing any other stuff you like...")
        print()

        dfshape = df.shape
        print('Shape:', dfshape)

    return wrapper

And use this in your reg2() function:

@load_and_reg2
def reg2(depvar, indepvars, results, df):
    Y = df[depvar]
    df['cons'] = 1
    X = df[indepvars]

    reg = sm.OLS(Y, X).fit()
    if results != 0:
        print(reg.summary())
    return reg

The example is perhaps very simplistic but demonstrates the power of Python:

In [7]: [predictors = ['weight', 'price', 'cons']
In [8]: reg2('mpg', predictors, 1)

Loading the dataset...

   mpg  weight  price
0   22    2930   4099
1   17    3350   4749
2   22    2640   3799
3   20    3250   4816
4   15    4080   7827

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.653
Model:                            OLS   Adj. R-squared:                  0.643
Method:                 Least Squares   F-statistic:                     66.85
Date:                                   Prob (F-statistic):           4.73e-17
Time:                                   Log-Likelihood:                -195.22
No. Observations:                  74   AIC:                             396.4
Df Residuals:                      71   BIC:                             403.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
weight        -0.0058      0.001     -9.421      0.000      -0.007      -0.005
price      -9.351e-05      0.000     -0.575      0.567      -0.000       0.000
cons          39.4397      1.622     24.322      0.000      36.206      42.673
==============================================================================
Omnibus:                       29.900   Durbin-Watson:                   2.347
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               60.190
Skew:                           1.422   Prob(JB):                     8.51e-14
Kurtosis:                       6.382   Cond. No.                     3.00e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Doing any other stuff you like...

Shape: (74, 13)

As you can see, the decorator further abstracts things but using fixed syntax.

In the Python universe dictionaries and classes also play important roles in re-using code/results. For example, a dictionary can act as the equivalent of Stata's return space for storing multiple macros, scalars etc.

Consider the slightly altered version of our toy decorator load_and_reg2, which now saves individual objects in a dictionary D and returns it:

def load_and_reg2(func):
    def wrapper(*args, **kwargs):

        D = {}

        print()   
        print("Loading the dataset...")
        print()

        df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
        sumvars = df[['mpg', 'weight', 'price']].head()
        D['sumvars'] = sumvars
        print(sumvars)
        print()

        D['reg2'] = func(*args, **kwargs, df)

        print()
        print("Doing any other stuff you like...")

        print()                     
        dfshape = df.shape
        D['dfshape'] = dfshape              
        print('Shape:', dfshape)

        return D    

    return wrapper

You can then easily do:

In [9]: foo = reg2('mpg', predictors, 1)

In [10]: foo.keys()
Out[10]: dict_keys(['sumvars', 'reg2', 'dfshape'])

In [11]: foo['sumvars']
Out[11]: 
   mpg  weight  price
0   22    2930   4099
1   17    3350   4749
2   22    2640   3799
3   20    3250   4816
4   15    4080   7827

Classes can introduce further flexibility at the cost of some additional complexity:

class loadreg2return(object):
    def __init__(self, sumvars=None, reg2=None, dfshape=None):
        self.sumvars = sumvars
        self.reg2 = reg2
        self.dfshape = dfshape

def load_and_reg2(func):
    def wrapper(*args, **kwargs):

        print("Loading the dataset...")
        print()

        df = pd.read_stata('http://www.stata-press.com/data/r14/auto.dta')
        sumvars = df[['mpg', 'weight', 'price']].head()
        print(sumvars)
        print()

        reg2 = func(*args, **kwargs, df = df)

        print()
        print("Doing any other stuff you like...")

        print()                     
        dfshape = df.shape
        loadreg2return(dfshape = dfshape)            
        print('Shape:', dfshape)

        return loadreg2return(sumvars = sumvars, reg2 = reg2, dfshape = dfshape )

    return wrapper

This version of our toy decorator returns:

In [12]: foo.dfshape
Out[12]: (74, 13)

In [13]: foo.sumvars
Out[13]: 
   mpg  weight  price
0   22    2930   4099
1   17    3350   4749
2   22    2640   3799
3   20    3250   4816
4   15    4080   7827

In [14]: foo.reg2.params
Out[14]: 
weight    -0.005818
price     -0.000094
cons      39.439656

dtype: float64


回答2:

It looks like you just want the * and ** operators for calling functions:

regargs = 'mpg', 'weight foreign price', 'y'
reg2(*regargs)

Use * to expand a list or tuple into positional arguments, or use ** to expand a dictionary into keyword arguments to a function that requires them.

For your keyword example, you need to change the declaration a little bit:

regargs = dict(depvar='mpg', covariates='weight foreign price', results='y')
reg2(**regargs)