I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
would output
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
as is, df.resample('2D', how='sum')
gets me
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
pd.Grouper
allows you to specify a "groupby instruction for a target object". In
particular, you can use it to group by dates even if df.index
is not a DatetimeIndex
:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1
tells pd.Grouper
to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper
turns out to be much faster than my original
suggestion, using_reset_index
:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper
, using_reset_index
, using_stack
on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
An alternative using stack/unstack
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:
- No idea about performance comparison
- Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
This works:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:
df['Date'] = pd.to_datetime(df['Date'])
I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:
>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-07-01 1
2012-07-02 13
2012-07-03 1
2012-07-04 1
2012-07-05 10
2012-07-06 4
2012-07-07 47
2012-07-08 0
2012-07-09 3
2012-07-10 22
2012-07-11 3
2012-07-12 0
2012-07-13 22
2012-07-14 1
2012-07-15 2
2012-07-16 2
2012-07-17 8
2012-07-18 0
2012-07-19 1
2012-07-20 10
2012-07-21 0
2012-07-22 3
2012-07-23 0
2012-07-24 35
2012-07-25 6
2012-07-26 1
2012-07-27 0
2012-07-28 6
2012-07-29 23
2012-07-30 0
..
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2014-06-02 0
2014-06-03 1
2014-06-04 0
2014-06-05 0
2014-06-06 0
2014-06-07 0
2014-06-08 2
2014-06-09 0
2014-06-10 0
2014-06-11 0
2014-06-12 0
2014-06-13 0
2014-06-14 0
2014-06-15 0
2014-06-16 0
2014-06-17 0
2014-06-18 0
2014-06-19 0
2014-06-20 0
2014-06-21 0
2014-06-22 0
2014-06-23 0
2014-06-24 0
2014-06-25 4
2014-06-26 0
2014-06-27 1
2014-06-28 0
2014-06-29 0
2014-06-30 1
2014-07-01 0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-06-25/2012-07-01 1
2012-07-02/2012-07-08 76
2012-07-09/2012-07-15 53
2012-07-16/2012-07-22 24
2012-07-23/2012-07-29 71
2012-07-30/2012-08-05 38
2012-08-06/2012-08-12 258
2012-08-13/2012-08-19 144
2012-08-20/2012-08-26 184
2012-08-27/2012-09-02 323
2012-09-03/2012-09-09 198
2012-09-10/2012-09-16 348
2012-09-17/2012-09-23 404
2012-09-24/2012-09-30 380
2012-10-01/2012-10-07 367
2012-10-08/2012-10-14 163
2012-10-15/2012-10-21 338
2012-10-22/2012-10-28 252
2012-10-29/2012-11-04 197
2012-11-05/2012-11-11 336
2012-11-12/2012-11-18 234
2012-11-19/2012-11-25 143
2012-11-26/2012-12-02 204
2012-12-03/2012-12-09 296
2012-12-10/2012-12-16 146
2012-12-17/2012-12-23 85
2012-12-24/2012-12-30 198
2012-12-31/2013-01-06 214
2013-01-07/2013-01-13 229
2013-01-14/2013-01-20 192
...
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2013-12-09/2013-12-15 3
2013-12-16/2013-12-22 0
2013-12-23/2013-12-29 0
2013-12-30/2014-01-05 1
2014-01-06/2014-01-12 3
2014-01-13/2014-01-19 6
2014-01-20/2014-01-26 11
2014-01-27/2014-02-02 0
2014-02-03/2014-02-09 1
2014-02-10/2014-02-16 4
2014-02-17/2014-02-23 3
2014-02-24/2014-03-02 1
2014-03-03/2014-03-09 4
2014-03-10/2014-03-16 0
2014-03-17/2014-03-23 0
2014-03-24/2014-03-30 9
2014-03-31/2014-04-06 1
2014-04-07/2014-04-13 1
2014-04-14/2014-04-20 1
2014-04-21/2014-04-27 2
2014-04-28/2014-05-04 8
2014-05-05/2014-05-11 7
2014-05-12/2014-05-18 5
2014-05-19/2014-05-25 2
2014-05-26/2014-06-01 8
2014-06-02/2014-06-08 3
2014-06-09/2014-06-15 0
2014-06-16/2014-06-22 0
2014-06-23/2014-06-29 5
2014-06-30/2014-07-06 1
dtype: int64
ts.unstack().T.resample('W', how='sum').T.stack()
is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.
I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample
function in the 0.19.2 docs, and I see there's a new kwarg
called "level" that you can use to specify a level in a MultiIndex.
Edit: More details in the "What's New" section.
I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.
Assuming your DataFrame is unindexed. (You can do .reset_index()
first), this works as follows:
- Group by the non-date columns
- Set "Date" as index and resample each chunk
- Reassemble using
pd.concat
The final code looks like:
pd.concat({g: x.set_index("Date").resample("2D").mean()
for g, x in house.groupby(["State", "City"])})