Here is my dataframe:
import pandas as pd
df = pd.DataFrame({
'KEY': [1, 2, 3, 1, 1, 2],
'START_DATE': ['2018-01-05', '2018-01-04', '2018-01-01', '2018-01-23', '2018-02-01', '2018-03-11'],
'STOP_DATE': ['2018-01-22', '2018-03-10', '2018-01-31', '2018-02-15', '2018-04-01', '2018-07-21'],
'AMOUNT': [5, 3, 11, 14, 7, 9],
})
df.START_DATE = pd.to_datetime(df.START_DATE, format='%Y-%m-%d')
df.STOP_DATE = pd.to_datetime(df.STOP_DATE, format='%Y-%m-%d')
df
>>> AMOUNT KEY START_DATE STOP_DATE
0 5 A 2018-01-05 2018-01-22
1 3 B 2018-01-04 2018-03-10
2 11 C 2018-01-01 2018-01-31
3 14 A 2018-01-23 2018-02-15
4 7 A 2018-02-01 2018-04-01
5 9 B 2018-03-11 2018-07-21
I am trying to get the AMOUNT
per month and per KEY
considering the AMOUNT
as linearly distributed (by day) between START_DATE
and STOP_DATE
. The output is shown below. I would like to also keep track of the number of charged days in a month. For example KEY = A
has overlapped periods in February so the number of charged periods can be > 28.
DAYS AMOUNT
A 2018_01 27 10.250000
2018_02 43 12.016667
2018_03 31 3.616667
2018_04 1 0.116667
B 2018_01 28 1.272727
2018_02 28 1.272727
2018_03 31 1.875598
2018_04 30 2.030075
2018_05 31 2.097744
2018_06 30 2.030075
2018_07 21 1.421053
C 2018_01 31 11.000000
2018_02 0 0.000000
I came up with the solution detailed below but it is highly inefficient and takes an unaffordable amount of time to run for a dataset with ~100 million rows. I am looking for an improved version but could not manage to vectorize the pd.date_range
part of it. Not sure if numba
@jit could help here? Added a tag just in case.
from pandas.tseries.offsets import MonthEnd
# Prepare the final dataframe (filled with zeros)
bounds = df.groupby('KEY').agg({'START_DATE': min, 'STOP_DATE':max}).reset_index()
multiindex = []
for row in bounds.itertuples():
dates = pd.date_range(start=row.START_DATE, end=row.STOP_DATE + MonthEnd(),
freq='M').strftime('%Y_%m')
multiindex.extend([(row.KEY, date) for date in dates])
index = pd.MultiIndex.from_tuples(multiindex)
final = pd.DataFrame(0, index=index, columns=['DAYS', 'AMOUNT'])
# Run the actual iteration over rows
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
for row in df.itertuples():
data = pd.Series(index=pd.date_range(start=row.START_DATE, end=row.STOP_DATE))
data = data.resample('MS').size().rename('DAYS').to_frame()
data['AMOUNT'] = data.DAYS / row.TOTAL_DAYS * row.AMOUNT
data.index = data.index.strftime('%Y_%m')
# Add data to the final dataframe
final.loc[(row.KEY, data.index.tolist()), 'DAYS'] += data.DAYS.values
final.loc[(row.KEY, data.index.tolist()), 'AMOUNT'] += data.AMOUNT.values