pandas rolling sum of last five minutes

Assume I have below data frame

Date, A
2014-11-21 11:00:00, 1
2014-11-21 11:03:00, 4
2014-11-21 11:04:00, 1
2014-11-21 11:05:00, 2
2014-11-21 11:07:00, 4
2014-11-21 11:08:00, 1
2014-11-21 11:12:00, 1
2014-11-21 11:13:00, 2

First column is datetime object and second column is an integer. What I want is to calculate the sum of column 'A' for the last five minute for each row.

As an example for the row 2014-11-21 11:12:00, 1, the sum of column 'A' would be 2(1+1) and the sum of column 'A' for the row 2014-11-21 11:05:00, 2 would be 7(2+1+4). Important thing is that the number of past rows for the time window(5 minutes) is not the same for each row (Because time series is irregular).

How can I get the sum of last five minutes for column 'A' using the rolling_sum method in pandas? Thanks in advance.

标签： python pandas sum time-series

1条回答

男人必须洒脱

2楼-- · 2019-02-02 13:21

In general, if the dates are completely arbitrary, I think you would be forced to use a Python for-loop over the rows or use df.apply, (which under the hood, also uses a Python loop.)

However, if your Dates share a common frequency, as is the case above, then there is a trick which should be much quicker than using df.apply: Expand the timeseries according to the common frequency -- in this case, 1 minute -- fill in the NaNs with zeros, and then call rolling_sum:

In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])
Out[279]: 
                      A
Date                   
2014-11-21 11:00:00   1
2014-11-21 11:03:00   5
2014-11-21 11:04:00   6
2014-11-21 11:05:00   7
2014-11-21 11:07:00  11
2014-11-21 11:08:00   8
2014-11-21 11:12:00   2
2014-11-21 11:13:00   3

Of course, any time series has a common frequency if you are willing to accept a small enough granularity, but the required size of df.asfreq(...) may make this trick impractical.

Here is an example of the more general approach using df.apply. Note that calling searchsorted relies on df['Date'] being in sorted order.

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))

def sum_window(row):
    return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)

print(df[['Date', 'A', 'rolling_sum']])

yields

                 Date  A  rolling_sum
0 2014-11-21 11:00:00  1            1
1 2014-11-21 11:03:00  4            5
2 2014-11-21 11:04:00  1            6
3 2014-11-21 11:05:00  2            7
4 2014-11-21 11:07:00  4           11
5 2014-11-21 11:08:00  1            8
6 2014-11-21 11:12:00  1            2
7 2014-11-21 11:13:00  2            3

Here is a benchmark comparing the df.asfreq trick versus calling df.apply:

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')

def big_df(df):
    df = df.copy()
    for i in range(7):
        dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute')
        df2 = pd.DataFrame({'Date': dates, 'A': df['A']})
        df = pd.concat([df, df2])
    df = df.reset_index(drop=True)
    return df

def using_apply():
    start_dates = df['Date'] - pd.Timedelta(minutes=5)
    df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
    df['end_index'] = np.arange(len(df))

    def sum_window(row):
        return df['A'].iloc[row['start_index']:row['end_index']+1].sum()

    df['rolling_sum'] = df.apply(sum_window, axis=1)
    return df[['Date', 'rolling_sum']]

def using_asfreq():
    result = (pd.rolling_sum(
        df.set_index(['Date']).asfreq('1T').fillna(0), 
        window=5, min_periods=1).reindex(df['Date']))
    return result

In [364]: df = big_df(df)

In [367]: %timeit using_asfreq()
1000 loops, best of 3: 1.21 ms per loop

In [368]: %timeit using_apply()
1 loops, best of 3: 208 ms per loop

0人赞添加讨论(0) 举报

pandas rolling sum of last five minutes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间