What is the most efficient way to loop through dat

2018-12-31 21:38发布

站内文章 / Python

64 0

泛滥B

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

#!/usr/bin/env python
from pandas import *

df = read_csv(\'table.csv\')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

回答1:

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu\'s suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

回答2:

Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

pct_change = []
for row in close:
    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, and think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

回答3:

You can loop through the rows by transposing and then calling iteritems:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it\'s not fast enough, convert things to Cython like this with minimal work to get something that\'s about as fast as hand-coded C/C++.

回答4:

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

t = pd.DataFrame({\'a\': range(0, 10000), \'b\': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r[\'a\'], r[\'b\']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t[\'a\'], t[\'b\']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

Result:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it\'s quick for me.

Here are some pros and cons IMHO:

.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row

回答5:

I checked out iterrows after noticing Nick Crawford\'s answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.

There\'s also iterkv, which iterates through (column, series) tuples.

回答6:

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)

回答7:

You have three options:

By index (simplest):

>>> for index in df.index:
...     print (\"df[\" + str(index) + \"][\'B\']=\" + str(df[\'B\'][index]))

With iterrows (most used):

>>> for index, row in df.iterrows():
...     print (\"df[\" + str(index) + \"][\'B\']=\" + str(row[\'B\']))

With itertuples (fastest):

>>> for row in df.itertuples():
...     print (\"df[\" + str(row.Index) + \"][\'B\']=\" + str(row.B))

Three options display something like:

df[0][\'B\']=125
df[1][\'B\']=415
df[2][\'B\']=23
df[3][\'B\']=456
df[4][\'B\']=189
df[5][\'B\']=456
df[6][\'B\']=12

Source: neural-networks.io

回答8:

As @joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times fater than iterrows, and I tested speed of both methods in a DataFrame with 5027505 records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.

If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code

>>> df = pd.DataFrame({\'col1\': [1, 2], \'col2\': [0.1, 0.2]},
                      index=[\'a\', \'b\'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():
...     print(row.col1, row.col2)
...
1, 0.1
2, 0.2

回答9:

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.

index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values

for i in range(df.shape[0]):
   index_value = index[i]
   ...
   column_value_k = column_of_interest_k[i]

Not pythonic? Sure. But fast.

If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

回答10:

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

标签： python performance for-loop pandas

泛滥B

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~