What is the most efficient way to loop through dat

2018-12-31 21:22发布

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

10条回答
余生无你
2楼-- · 2018-12-31 21:50

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

查看更多
临风纵饮
3楼-- · 2018-12-31 21:53

You can loop through the rows by transposing and then calling iteritems:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

查看更多
低头抚发
4楼-- · 2018-12-31 21:54

As @joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times fater than iterrows, and I tested speed of both methods in a DataFrame with 5027505 records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.

If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
                      index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2
查看更多
呛了眼睛熬了心
5楼-- · 2018-12-31 21:55

I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.

There's also iterkv, which iterates through (column, series) tuples.

查看更多
登录 后发表回答