What to replace loops and nested if sentences with

2019-09-17 11:08发布

问题:

How can I avoid for loops and nested if sentences and be more Pythonic?

At first glance this may seem like a "please do my all of my work for me" question. I can assure you that it is not. I'm trying to learn some real Python, and would like to discover ways of speeding up code based on a reproducible example and a pre-defined function.

I'm calculating returns from following certain signals in financial markets using loads of for loops and nested if sentences. I have made several attempts, but I am just getting nowhere with vectorizing or comprehensions or other more pythonic tools of the trade. I've been OK with that so far, but finally I'm starting to feel the pain of using functions that are simply too slow at scale.

I have a dataframe with two indexes and one particular event. The two first code snippets are included to show the procedure step by step. I've included the complete thing with some predefined settings and a function at the very end.

In[ 1 ]

# Settings
import numpy as np
import pandas as pd
import datetime
np.random.seed(12345678)

Observations = 10

# Data frame values:
# Two indicators with values betwwen 0 and 10
# and one Event which does or does not occur with values 0 or 1
df = pd.DataFrame(np.random.randint(0,10,size=(Observations, 2)),
                  columns=['IndicatorA', 'IndicatorB'] )
df['Event'] = np.random.randint(0,2,size=(Observations, 1))

# Data frame index:
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])    

# Placeholder for signals based on the existing values
# in the data frame
df['Signal'] = 0

print(df)

Out[ 1 ]

The data frame is indexed by dates. The signal I'm looking for is determined by the interaction of these indicators and events. The Signal is calculated the following way (expanding on the snippet above):

In[ 2 ]

i = 0
for signals in df['Signal']:
    if i == 0: 
        # First signal is always zero
        df.ix[i,'Signal'] = 0
    else:
        # Signal is 1 if Indicator A is above a certain level
        if df.ix[i,'IndicatorA'] > 5:                
            df.ix[i,'Signal'] = 1
        else:
            # Signal is 1 if Indicator B is above a certain level
            # AND a certain event occurs                
            if df.ix[i - 1,'IndicatorB'] > 5 & df.ix[i,'Event'] > 1:
                 df.ix[i,'Signal'] = 1
            else:
                df.ix[i,'Signal'] = 0          
    i = i + 1    

print(df['Signal'])

Out[ 2 ]

Below is the whole thing defined as a function. Notice that the function returns the average of the Signal instead of the Signal column itself. This way the console is not cluttered when the code is run, and we can test the efficency of the code using %time in ipython.

# Settings
import numpy as np
import pandas as pd
import datetime

# The whole thing defined as a function

def fxSlow(Observations):

    np.random.seed(12345678)

    df = pd.DataFrame(np.random.randint(0,10,size=(Observations, 2)),
                        columns=['IndicatorA', 'IndicatorB'] )
    df['Event'] = np.random.randint(0,2,size=(Observations, 1))

    datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                periods=Observations).tolist()
    df['Signal'] = 0

    df['Dates'] = datelist
    df = df.set_index(['Dates'])

    i = 0
    for signals in df['Signal']:
        if i == 0: 
            # First signal is always zero
            df.ix[i,'Signal'] = 0
        else:
            # Signal is 1 if Indocator A is above a certain level
            if df.ix[i,'IndicatorA'] > 5:                
                df.ix[i,'Signal'] = 1
            else:
                # Signal is 1 if Indicator B is above a certain level
                # AND a certain event occurs                
                if df.ix[i - 1,'IndicatorB'] > 5 & df.ix[i,'Event'] > 1:
                     df.ix[i,'Signal'] = 1
                else:
                    df.ix[i,'Signal'] = 0          
        i = i + 1    


    return np.mean(df['Signal'])

Below you can see the results of running the function with different observations / size of the data frame:

So, how can I speed things up by being more Pythonic?

And as a bonus question, what causes the error when I increase the number of observations to 100000?

回答1:

Can you try something like this?

def fxSlow2(Observations):

    np.random.seed(12345678)

    df = pd.DataFrame(np.random.randint(0,10,size=(Observations, 2)),
                        columns=['IndicatorA', 'IndicatorB'] )
    df['Event'] = np.random.randint(0,2,size=(Observations, 1))

    datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                periods=Observations).tolist()
    df['Signal'] = 0

    df['Dates'] = datelist
    df = df.set_index(['Dates'])

    df['Signal'] = (np.where(df.IndicatorA > 5, 
          1, 
          np.where( (df.shift(-1).IndicatorB > 5) &(df.Event > 1), 
                    1, 
                    0)
          )
    )

    df.loc[df.index[0],'Signal'] = 0

    return np.mean(df['Signal'])

%time fxSlow2(100)

Wall time: 10 ms

Out[208]: 0.43

%time fxSlow2(1000)

Wall time: 15 ms

Out[209]: 0.414

%time fxSlow2(10000)

Wall time: 61 ms

Out[210]: 0.4058