For each row in the input table, I need to generate multiple rows by separating the date range based on monthly. (please refer to the below sample output).
There is a simple iterative approach to convert row by row, but it is very slow on large dataframes.
Could anyone suggest a vectorized approach, such as using apply(), map() etc. to achieve the objective?
The output table is a new table.
Input:
ID, START_DATE, END_DATE
1, 2010-12-08, 2011-03-01
2, 2010-12-10, 2011-01-12
3, 2010-12-16, 2011-03-07
Output:
ID, START_DATE, END_DATE, NUMBER_DAYS, ACTION_DATE
1, 2010-12-08, 2010-12-31, 23, 201012
1, 2010-12-08, 2011-01-31, 54, 201101
1, 2010-12-08, 2011-02-28, 82, 201102
1, 2010-12-08, 2011-03-01, 83, 201103
2, 2010-12-10, 2010-12-31, 21, 201012
2, 2010-12-10, 2011-01-12, 33, 201101
3, 2010-12-16, 2010-12-31, 15, 201012
4, 2010-12-16, 2011-01-31, 46, 201101
5, 2010-12-16, 2011-02-28, 74, 201102
6, 2010-12-16, 2011-03-07, 81, 201103
I think you can use:
import pandas as pd
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3},
'END_DATE': {0: pd.Timestamp('2011-03-01 00:00:00'),
1: pd.Timestamp('2011-01-12 00:00:00'),
2: pd.Timestamp('2011-03-07 00:00:00')},
'START_DATE': {0: pd.Timestamp('2010-12-08 00:00:00'),
1: pd.Timestamp('2010-12-10 00:00:00'),
2: pd.Timestamp('2010-12-16 00:00:00')}},
columns=['ID','START_DATE', 'END_DATE'])
print df
ID START_DATE END_DATE
0 1 2010-12-08 2011-03-01
1 2 2010-12-10 2011-01-12
2 3 2010-12-16 2011-03-07
#if multiple columns, you can filter them by subset
#df = df[['ID','START_DATE', 'END_DATE']]
#stack columns START_DATE and END_DATE
df1 = df.set_index('ID')
.stack()
.reset_index(level=1, drop=True)
.to_frame()
.rename(columns={0:'Date'})
#print df1
#resample and fill missing data
df1 = df1.groupby(df1.index).apply(lambda x: x.set_index('Date').resample('M').asfreq())
.reset_index()
print df1
ID Date
0 1 2010-12-31
1 1 2011-01-31
2 1 2011-02-28
3 1 2011-03-31
4 2 2010-12-31
5 2 2011-01-31
6 3 2010-12-31
7 3 2011-01-31
8 3 2011-02-28
9 3 2011-03-31
There is problem with last day of Month
, because resample
add last day of Month
, so first create period
columns and then merge
them. By combine_first
add missing values from column Date
and by bfill
add missing values of column START_DATE
.
df['period'] = df.END_DATE.dt.to_period('M')
df1['period'] = df1.Date.dt.to_period('M')
df2 = pd.merge(df1, df, on=['ID','period'], how='left')
df2['END_DATE'] = df2.END_DATE.combine_first(df2.Date)
df2['START_DATE'] = df2.START_DATE.bfill()
df2 = df2.drop(['Date','period'], axis=1)
Last add new columns by difference with dt.days
and dt.strftime
:
df2['NUMBER_DAYS'] = (df2.END_DATE - df2.START_DATE).dt.days
df2['ACTION_DATE'] = df2.END_DATE.dt.strftime('%Y%m')
print df2
ID START_DATE END_DATE NUMBER_DAYS ACTION_DATE
0 1 2010-12-08 2010-12-31 23 201012
1 1 2010-12-08 2011-01-31 54 201101
2 1 2010-12-08 2011-02-28 82 201102
3 1 2010-12-08 2011-03-01 83 201103
4 2 2010-12-10 2010-12-31 21 201012
5 2 2010-12-10 2011-01-12 33 201101
6 3 2010-12-16 2010-12-31 15 201012
7 3 2010-12-16 2011-01-31 46 201101
8 3 2010-12-16 2011-02-28 74 201102
9 3 2010-12-16 2011-03-07 81 201103
You can also try this. Using Pandas date_range function and DataFrame apply concept.
In your Ouptut, for the ID after 3, you have mentioned 4,5,6. I believe it should be 3. Please check.
import pandas as pd
from datetime import datetime
l_ret_df = pd.DataFrame(columns=('ID', 'START_DATE', 'END_DATE', 'NUMBER_DAYS', 'ACTION_DATE'))
def generate_ts_df(p_row):
l_id = p_row['ID']
l_start = p_row['START_DATE']
l_start_date = datetime.strptime(l_start,'%Y-%m-%d')
l_end = p_row['END_DATE']
l_end_date = datetime.strptime(l_end,'%Y-%m-%d')
l_df = pd.date_range(start=l_start,end=l_end,freq='M',closed=None)
global l_ret_df
for e in l_df:
l_ret_df = l_ret_df.append(pd.DataFrame([[l_id,l_start,e.date(),(e.date()-l_start_date.date()).days,e.strftime('%Y%m')]],columns=('ID', 'START_DATE', 'END_DATE', 'NUMBER_DAYS', 'ACTION_DATE')))
l_ret_df = l_ret_df.append(pd.DataFrame([[l_id,l_start,l_end,(l_end_date.date()-l_start_date.date()).days,l_end_date.strftime('%Y%m')]],columns=('ID', 'START_DATE', 'END_DATE', 'NUMBER_DAYS', 'ACTION_DATE')))
return 1
if __name__ == "__main__":
l_ts_base = pd.DataFrame([[1, '2010-12-08', '2011-03-01'],
[2, '2010-12-10', '2011-01-12'],
[3, '2010-12-16', '2011-03-07']], columns=('ID', 'START_DATE', 'END_DATE'))
l_ts_base.apply(generate_ts_df, axis=1)
print l_ret_df
Output
ID START_DATE END_DATE NUMBER_DAYS ACTION_DATE
0 1 2010-12-08 2010-12-31 23 201012
0 1 2010-12-08 2011-01-31 54 201101
0 1 2010-12-08 2011-02-28 82 201102
0 1 2010-12-08 2011-03-01 83 201103
0 2 2010-12-10 2010-12-31 21 201012
0 2 2010-12-10 2011-01-12 33 201101
0 3 2010-12-16 2010-12-31 15 201012
0 3 2010-12-16 2011-01-31 46 201101
0 3 2010-12-16 2011-02-28 74 201102
0 3 2010-12-16 2011-03-07 81 201103