I want to perform my own complex operations on financial data in dataframes in a sequential manner.
For example I am using the following MSFT CSV file taken from Yahoo Finance:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
I then do the following:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems
unfortunately only iterates column by column.
Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if
close
is a 1-d array, and you want the day-over-day percent change,This computes the entire array of percent changes as one statement, instead of
So try to avoid the Python loop
for i, row in enumerate(...)
entirely, and think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.
Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.
Result:
This is probably not the best way to measure the time consumption but it's quick for me.
Here are some pros and cons IMHO:
For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via
df.values
(as you do) or by accessing each column separatelydf.column_name.values
. Since you want to have access to the index too, you can usedf.index.values
for that.Not pythonic? Sure. But fast.
If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.
Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
You have three options:
By index (simplest):
With iterrows (most used):
With itertuples (fastest):
Three options display something like:
Source: neural-networks.io