aggregating hourly time series by Day via pd.TimeG

2019-08-03 03:48发布

问题:

df:

                    hour    rev
datetime        
2016-05-01 01:00:00 1   -0.02
2016-05-01 02:00:00 2   -0.01
2016-05-01 03:00:00 3   -0.02
2016-05-01 04:00:00 4   -0.02
2016-05-01 05:00:00 5   -0.01
2016-05-01 06:00:00 6   -0.03
2016-05-01 07:00:00 7   -0.10
2016-05-01 08:00:00 8   -0.09
2016-05-01 09:00:00 9   -0.08
2016-05-01 10:00:00 10  -0.10
2016-05-01 11:00:00 11  -0.12
2016-05-01 12:00:00 12  -0.14
2016-05-01 13:00:00 13  -0.17
2016-05-01 14:00:00 14  -0.16
2016-05-01 15:00:00 15  -0.15
2016-05-01 16:00:00 16  -0.15
2016-05-01 17:00:00 17  -0.17
2016-05-01 18:00:00 18  -0.16
2016-05-01 19:00:00 19  -0.18
2016-05-01 20:00:00 20  -0.17
2016-05-01 21:00:00 21  -0.14
2016-05-01 22:00:00 22  -0.16
2016-05-01 23:00:00 23  -0.08
2016-05-02 00:00:00 24  -0.06

df.reset_index().to_dict('rec'):

[{'datetime': Timestamp('2016-05-01 01:00:00'), 'hour': 1L, 'rev': -0.02},
 {'datetime': Timestamp('2016-05-01 02:00:00'), 'hour': 2L, 'rev': -0.01},
 {'datetime': Timestamp('2016-05-01 03:00:00'), 'hour': 3L, 'rev': -0.02},
 {'datetime': Timestamp('2016-05-01 04:00:00'), 'hour': 4L, 'rev': -0.02},
 {'datetime': Timestamp('2016-05-01 05:00:00'), 'hour': 5L, 'rev': -0.01},
 {'datetime': Timestamp('2016-05-01 06:00:00'), 'hour': 6L, 'rev': -0.03},
 {'datetime': Timestamp('2016-05-01 07:00:00'), 'hour': 7L, 'rev': -0.1},
 {'datetime': Timestamp('2016-05-01 08:00:00'), 'hour': 8L, 'rev': -0.09},
 {'datetime': Timestamp('2016-05-01 09:00:00'), 'hour': 9L, 'rev': -0.08},
 {'datetime': Timestamp('2016-05-01 10:00:00'), 'hour': 10L, 'rev': -0.1},
 {'datetime': Timestamp('2016-05-01 11:00:00'), 'hour': 11L, 'rev': -0.12},
 {'datetime': Timestamp('2016-05-01 12:00:00'), 'hour': 12L, 'rev': -0.14},
 {'datetime': Timestamp('2016-05-01 13:00:00'), 'hour': 13L, 'rev': -0.17},
 {'datetime': Timestamp('2016-05-01 14:00:00'), 'hour': 14L, 'rev': -0.16},
 {'datetime': Timestamp('2016-05-01 15:00:00'), 'hour': 15L, 'rev': -0.15},
 {'datetime': Timestamp('2016-05-01 16:00:00'), 'hour': 16L, 'rev': -0.15},
 {'datetime': Timestamp('2016-05-01 17:00:00'), 'hour': 17L, 'rev': -0.17},
 {'datetime': Timestamp('2016-05-01 18:00:00'), 'hour': 18L, 'rev': -0.16},
 {'datetime': Timestamp('2016-05-01 19:00:00'), 'hour': 19L, 'rev': -0.18},
 {'datetime': Timestamp('2016-05-01 20:00:00'), 'hour': 20L, 'rev': -0.17},
 {'datetime': Timestamp('2016-05-01 21:00:00'), 'hour': 21L, 'rev': -0.14},
 {'datetime': Timestamp('2016-05-01 22:00:00'), 'hour': 22L, 'rev': -0.16},
 {'datetime': Timestamp('2016-05-01 23:00:00'), 'hour': 23L, 'rev': -0.08},
 {'datetime': Timestamp('2016-05-02 00:00:00'), 'hour': 24L, 'rev': -0.06}]

df.set_index('datetime', inplace=True)

I want to aggregate the data by DAY. So I do:

dfgrped = df.groupby([pd.TimeGrouper('D')])

I want to compute stats like the sum:

dfgrped.agg(sum)

            hour    rev
datetime        
2016-05-01  276 -2.43
2016-05-02  24  -0.06

As you can see the aggregation occurs for 2016-05-01 and 2016-05-02.

Notice, that the last hourly data entry in df occurs at 2016-05-02 00:00:00, which is meant to be the data for the last hour of the previous day i.e. 24 hourly data points for each day.

However, given the datetime stamp, things don't work out the way I intended. I want all 24 hours to be aggregated for 2016-05-01.

I imagine this sort of issue must arise often in various applications when a measurement is taken at the end of the hour. This isn't a problem until the last hour, which occurs at the 00:00:00 timestamp of the following day.

How to address this issue in pandas?

回答1:

it looks like another hack, but it should do the job:

In [79]: df.assign(t=df.datetime - pd.Timedelta(hours=1)).drop('datetime',1).groupby(pd.TimeGrouper('D', key='t')).sum()
Out[79]:
            hour   rev
t
2016-05-01   300 -2.49


回答2:

A little bit hack solution, if your starting point for each day is larger than one second, you can subtract one second from the date time column and then groupby date, which seems to work for your case:

from datetime import timedelta
import pandas as pd
df.groupby((df.datetime - timedelta(seconds = 1)).dt.date).sum()

#             hour    rev
#   datetime        
# 2016-05-01   300  -2.49


回答3:

Simply .shift(-1) or .roll(-1), the rev column, backward one. So timestamp would be period start vs period end. You would need to add one timestamp.