I am working on one of the data cleansing project, I have to clean multiple fields of a pandas data frame as part of it. Mostly I am writing regular expressions and simple functions. Examples below,
def func1(s):
s = str(s)
s = s.replace(' ', '')
if len(s) > 0 and s != '0':
if s.isalpha() and len(s) < 2:
return s
def func2(s):
s = str(s)
s = s.replace(' ', '')
s = s.strip(whitespace+','+'-'+'/'+'\\')
if s != '0':
if s.isalnum() or s.isdigit():
return s
def func3(s):
s = str(s)
if s.isdigit() and s != '0':
return s
else:
return None
def func4(s):
if str(s['j']).isalpha() and str(s['k']).isdigit() and s['l'] is none:
return s['k']
And calling them like this.
x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['d'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
x['g'] = x.apply(lambda x: func4(x), axis = 1)
Everything is fine here, however I have written nearly 50 such functions like this and my dataset has more than 10 million records. Script runs for hours, If my understanding is correct, the functions are called row wise, so each function is called as many times as the rows and its taking long time to process this. Is there a way to optimise this? How can I approach this in a better way? May be not through apply function? Thanks.
Sample dataset:-
Name f j b
339043 Moir Point RD 3 0
21880 Fisher-Point Drive Freemans Ba 6 0
457170 Whakamoenga Point 29 0
318399 Motukaraka Point RD 0 0
274047 Apirana Avenue Point England 360 0 366
207588 Hobsonville Point RD 127 0
747136 Dog Point RD 130 0
325704 Aroha Road Te Arai Point 36 0
291888 One Tree Point RD 960 0
207954 Hobsonville Point RD 160 0 205D
248410 Huia Road Point Chevalier 106 0
In general, you should avoid calling
.apply
on aDataFrame
. This is really what is getting you. Under the hood, it is creating a newSeries
for each row in theDataFrame
and sends that to the function passed to.apply
. Needless to say, this is quite a lot of overhead per row and thus.apply
is on a fullDataFrame
is slow.In the below example, I have renamed some of the columns in the function calls since the example data was limited.
The output:
It turns out that for
func1
,func2
, andfunc3
it is a wash as far as performance when compared to the vectorized methods..apply
(and.map
for that matter) onSeries
isn't so slow because there is no extra overhead per element. However, this does not mean that you should just use.apply
when you have aSeries
and not investigate the vectorized built-in methods of theSeries
- more often than not you are likely to be able to do better thanapply
.Here's how you might rewrite
func3
to be vectorized (I added timing statements so we could see what takes the most time).Here is the timing compared to
func3
:It takes a long time to just change the
dtype
of aSeries
, since a newSeries
must be created, and then each element gets cast. Everything else is blazing. If you could change your algorithm to not require changing the datatype tostr
, or could simply store asstr
in the first place then the vectorized method would be much faster (especiallyvectorized_func4
).Takeaway
.apply
on a fullDataFrame
unless you absolutely must. If you think you must, go get a drink of coffee and think about it for ten minutes and try to think of a way to do it without.apply
..apply
on aSeries
, you can probably do better but it won't be as bad as on a fullDataFrame
.dtype
.Instead of multiple functions you can use if..elif in one function with all condtions. Just a thought!