I have data in the following general format that I would like to resample to 30 day time series windows:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I would like the 30 day windows to start on 2014-01-01 and end on 12-31-2018. It is NOT guaranteed that every customer will have records in every window. If a customer has multiple transactions in a window, then it takes the weighted average of the price, sums the units, and concat the product names to create one record per customer per window.
What I have so far is something like this:
wa = lambda x:np.average(x, weights=df.loc[x.index, 'units'])
con = lambda x: '/'.join(x))
agg_funcs = {'customer_id':'first',
'product':'con',
'price':'wa',
'transaction_dt':'first',
'units':'sum'}
df_window = df.groupby(['customer_id', pd.Grouper(freq='30D')]).agg(agg_funcs)
df_window_final = df_window.unstack('customer_id', fill_value=0)
If anyone knows some better ways to approach this problem (particularly with an in-place and/or vectorized method), I would appreciate it. Ideally, I would also like to add the window start and stop dates as columns to the rows as well.
The final output would look like this ideally:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
1,2004-01-02,thing1/thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
2,2004-01-29,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2017-07-15,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2016-05-12,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2012-02-23,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2009-10-10,thing1,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2014-04-04,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
5,2008-07-09,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
Edited for new solution. I think you can convert each of the
transaction_dt
to a Period object of 30 days and then do the grouping.We can now use this dataframe to get the concatenation of products, weighted average of price and sum of units. We then use some of the Period functionality to get the end time.