speed up python apply row wise functions

I am working on one of the data cleansing project, I have to clean multiple fields of a pandas data frame as part of it. Mostly I am writing regular expressions and simple functions. Examples below,

def func1(s):
    s = str(s)
    s = s.replace(' ', '')
    if len(s) > 0 and s != '0':
        if s.isalpha() and len(s) < 2:
            return s

def func2(s):
    s = str(s)
    s = s.replace(' ', '')
    s = s.strip(whitespace+','+'-'+'/'+'\\')
    if s != '0':
        if s.isalnum() or s.isdigit():
            return s

def func3(s):
    s = str(s)
    if s.isdigit() and s != '0':
        return s
    else:
        return None

def func4(s):
    if str(s['j']).isalpha() and str(s['k']).isdigit() and s['l'] is none:
        return s['k']

And calling them like this.

x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['d'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
x['g'] = x.apply(lambda x: func4(x), axis = 1)

Everything is fine here, however I have written nearly 50 such functions like this and my dataset has more than 10 million records. Script runs for hours, If my understanding is correct, the functions are called row wise, so each function is called as many times as the rows and its taking long time to process this. Is there a way to optimise this? How can I approach this in a better way? May be not through apply function? Thanks.

Sample dataset:-

        Name                               f    j    b
339043  Moir Point RD                      3    0   
21880   Fisher-Point Drive Freemans Ba     6    0   
457170  Whakamoenga Point                 29    0   
318399  Motukaraka Point RD                0    0   
274047  Apirana Avenue Point England     360    0   366
207588  Hobsonville Point RD             127    0   
747136  Dog Point RD                     130    0   
325704  Aroha Road Te Arai Point          36    0   
291888  One Tree Point RD                960    0   
207954  Hobsonville Point RD             160    0   205D
248410  Huia Road Point Chevalier        106    0

回答1:

In general, you should avoid calling .apply on a DataFrame. This is really what is getting you. Under the hood, it is creating a new Series for each row in the DataFrame and sends that to the function passed to .apply. Needless to say, this is quite a lot of overhead per row and thus .apply is on a full DataFrame is slow.

In the below example, I have renamed some of the columns in the function calls since the example data was limited.

import sys
import time
import contextlib
import pandas as pd

@contextlib.contextmanager
def timethis(label):
    '''A context manager to time a bit of code.'''
    print('Timing', label, end=': ')
    sys.stdout.flush()
    start = time.time()
    yield
    print('{:.4g} seconds'.format(time.time() - start))

... func1, func2, and func3 definitions...

def func4(s):
    if str(s['j']).isalpha() and str(s['f']).isdigit() and s['b'] is none:
        return s['f']

x = pd.DataFrame({'f': [3, 6, 29, 0, 360, 127, 130, 36, 960, 160, 106],
                  'j': 0,
                  'b': [None, None, None, None, 366, None, None, None, None, '205D', None]})
x = pd.concat(x for _ in range(100000))
y = x.copy()

x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['j'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
with timethis('func4'):
    x['g'] = x.apply(func4, axis = 1)  # The lambda in your example was not needed

...

def vectorized_func4(df):
    '''Accept the whole DataFrame and not just a single row.'''
    j_isalpha = df['j'].astype(str).str.isalpha()
    f_isdigit = df['f'].astype(str).str.isdigit()
    b_None = df['b'].isnull()
    ret_col = df['f'].copy()
    keep_rows = j_isalpha & f_isdigit & b_None
    ret_col[~keep_rows] = None
    return ret_col

y['a'] = vectorized_func1(y['b'])
y['c'] = vectorized_func2(y['j'])
y['e'] = vectorized_func3(y['f'])
with timethis('vectorized_func4'):
    y['g'] = vectorized_func4(y)

The output:

Timing func4: 115.9 seconds
Timing vectorized_func4: 25.09 seconds

It turns out that for func1, func2, and func3 it is a wash as far as performance when compared to the vectorized methods. .apply (and .map for that matter) on Series isn't so slow because there is no extra overhead per element. However, this does not mean that you should just use .apply when you have a Series and not investigate the vectorized built-in methods of the Series - more often than not you are likely to be able to do better than apply.

Here's how you might rewrite func3 to be vectorized (I added timing statements so we could see what takes the most time).

def vectorized_func3(col):
    with timethis('fillna'):
        col = col.fillna('')
    with timethis('astype'):
        col = col.astype(str)
    with timethis('rest'):
        is_digit_string = col.str.isdigit()
        not_0_string = col != '0'
        keep_rows = is_digit_string & not_0_string
        col[~keep_rows] = None
    return col

Here is the timing compared to func3:

Timing func3: 8.302 seconds
Timing fillna: 0.006584 seconds
Timing astype: 9.445 seconds
Timing rest: 1.65 seconds

It takes a long time to just change the dtype of a Series, since a new Series must be created, and then each element gets cast. Everything else is blazing. If you could change your algorithm to not require changing the datatype to str, or could simply store as str in the first place then the vectorized method would be much faster (especially vectorized_func4).

Takeaway

Don't use .apply on a full DataFrame unless you absolutely must. If you think you must, go get a drink of coffee and think about it for ten minutes and try to think of a way to do it without .apply.
Try not to use .apply on a Series, you can probably do better but it won't be as bad as on a full DataFrame.
Try to come up with an algorithm that does not require constantly converting dtype.

回答2: