Assume I have below data frame
Date, A
2014-11-21 11:00:00, 1
2014-11-21 11:03:00, 4
2014-11-21 11:04:00, 1
2014-11-21 11:05:00, 2
2014-11-21 11:07:00, 4
2014-11-21 11:08:00, 1
2014-11-21 11:12:00, 1
2014-11-21 11:13:00, 2
First column is datetime object and second column is an integer. What I want is to calculate the sum of column 'A' for the last five minute for each row.
As an example for the row 2014-11-21 11:12:00, 1
, the sum of column 'A' would be 2(1+1) and the sum of column 'A' for the row 2014-11-21 11:05:00, 2
would be 7(2+1+4). Important thing is that the number of past rows for the time window(5 minutes) is not the same for each row (Because time series is irregular).
How can I get the sum of last five minutes for column 'A' using the rolling_sum method in pandas? Thanks in advance.
In general, if the dates are completely arbitrary, I think you would be forced to use a Python
for-loop
over the rows or usedf.apply
, (which under the hood, also uses a Python loop.)However, if your Dates share a common frequency, as is the case above, then there is a trick which should be much quicker than using
df.apply
: Expand the timeseries according to the common frequency -- in this case, 1 minute -- fill in the NaNs with zeros, and then callrolling_sum
:Of course, any time series has a common frequency if you are willing to accept a small enough granularity, but the required size of
df.asfreq(...)
may make this trick impractical.Here is an example of the more general approach using
df.apply
. Note that callingsearchsorted
relies ondf['Date']
being in sorted order.yields
Here is a benchmark comparing the
df.asfreq
trick versus callingdf.apply
: