I am working with non-uniformly collected, timestamp indexed data and will eventually be computing statistics on a per minute, per hourly basis. I'm wondering what the best way to aggregate by time periods is.
I currently compute two lambda functions and then add two columns to the dataframe like so:
h = lambda i: pd.to_datetime(i.strftime('%Y-%m-%d %H:00:00'))
m = lambda i: pd.to_datetime(i.strftime('%Y-%m-%d %H:%M:00'))
df['hours'] = df.index.map(h)
df['minutes'] = df.index.map(m)
This allows me to aggregate easily with groupby
like so:
by_hour = df.groupby('hours')
I'm sure there is a better or more pythonic way to do this, but I haven't figured it out and would appreciate any help.
You have a couple options with pandas. For simple statistics, you can use the resample method on a DataFrame/Series with a datetime index.
For more flexibility you can groupby the
hour
(or minute, second, etc.) attribute of the timestamp objects:Take a look at the docs on resampling: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#up-and-downsampling