I have noticed very poor performance when using iterrows from pandas.
Is this something that is experienced by others? Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?
This discussion on GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:
import pandas as pd
import numpy as np
import time
s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})
start = time.time()
i=0
for rowindex, row in dfa.iterrows():
i+=1
end = time.time()
print end - start
Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.
I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes. Thank you!
---Edit: simplified version of what I want to run has been added below---
import pandas as pd
import numpy as np
#%% Create the original tables
t1 = {'letter':['a','b'],
'number1':[50,-10]}
t2 = {'letter':['a','a','b','b'],
'number2':[0.2,0.5,0.1,0.4]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])
#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():
t2info = table2[table2.letter == row['letter']].reset_index()
table3.ix[row_index,] = optimize(t2info,row['number1'])
#%% Define optimization
def optimize(t2info, t1info):
calculation = []
for index, r in t2info.iterrows():
calculation.append(r['number2']*t1info)
maxrow = calculation.index(max(calculation))
return t2info.ix[maxrow]
Yes, Pandas itertuples() is faster than iterrows(). you can refer the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
"To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows."
Another option is to use
to_records()
, which is faster than bothitertuples
anditerrows
.But for your case, there is much room for other types of improvements.
Here's my final optimized version
Benchmark test:
Full code:
The final version is almost 10x faster than the original code. The strategy is:
groupby
to avoid repeated comparing of values.to_records
to access raw numpy.records objects.Generally,
iterrows
should only be used in very very specific cases. This is the general order of precedence for performance of various operations:Using a custom cython routine is usually too complicated, so let's skip that for now.
1) Vectorization is ALWAYS ALWAYS the first and best choice. However, there are a small set of cases which cannot be vectorized in obvious ways (mostly involving a recurrence). Further, on a smallish frame, it may be faster to do other methods.
3) Apply involves can usually be done by an iterator in Cython space (this is done internally in pandas) (this is a) case.
This is dependent on what is going on inside the apply expression. e.g.
df.apply(lambda x: np.sum(x))
will be executed pretty swiftly (of coursedf.sum(1)
is even better). However something like:df.apply(lambda x: x['b'] + 1)
will be executed in python space, and consequently is slower.4)
itertuples
does not box the data into a Series, just returns it as a tuple5)
iterrows
DOES box the data into a Series. Unless you really need this, use another method.6) updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and
concat
.Here's the way to do your problem. This is all vectorized.
Vector operations in Numpy and pandas are much faster than scalar operations in vanilla Python for several reasons:
Amortized type lookup: Python is a dynamically typed language, so there is runtime overhead for each element in an array. However, Numpy (and thus pandas) perform calculations in C (often via Cython). The type of the array is determined only at the start of the iteration; this savings alone is one of the biggest wins.
Better caching: Iterating over a C array is cache-friendly and thus very fast. A pandas DataFrame is a "column-oriented table", which means that each column is really just an array. So the native actions you can perform on a DataFrame (like summing all the elements in a column) are going to have few cache misses.
More opportunities for parallelism: A simple C array can be operated on via SIMD instructions. Some parts of Numpy enable SIMD, depending on your CPU and installation process. The benefits to parallelism won't be as dramatic as the static typing and better caching, but they're still a solid win.
Moral of the story: use the vector operations in Numpy and pandas. They are faster than scalar operations in Python for the simple reason that these operations are exactly what a C programmer would have written by hand anyway. (Except that the array notion is much easier to read than explicit loops with embedded SIMD instructions.)