python pandas: trying to vectorize a function usin

2019-01-28 12:55发布

问题:

Here is my dataframe:

import pandas as pd
df = pd.DataFrame({
    'KEY': [1, 2, 3, 1, 1, 2],
    'START_DATE': ['2018-01-05', '2018-01-04', '2018-01-01', '2018-01-23', '2018-02-01', '2018-03-11'],
    'STOP_DATE': ['2018-01-22', '2018-03-10', '2018-01-31', '2018-02-15', '2018-04-01', '2018-07-21'],
    'AMOUNT': [5, 3, 11, 14, 7, 9],
    })
df.START_DATE = pd.to_datetime(df.START_DATE, format='%Y-%m-%d')
df.STOP_DATE = pd.to_datetime(df.STOP_DATE, format='%Y-%m-%d')
df
>>>   AMOUNT KEY  START_DATE  STOP_DATE
0       5     A   2018-01-05  2018-01-22
1       3     B   2018-01-04  2018-03-10
2      11     C   2018-01-01  2018-01-31
3      14     A   2018-01-23  2018-02-15
4       7     A   2018-02-01  2018-04-01
5       9     B   2018-03-11  2018-07-21

I am trying to get the AMOUNT per month and per KEY considering the AMOUNT as linearly distributed (by day) between START_DATE and STOP_DATE. The output is shown below. I would like to also keep track of the number of charged days in a month. For example KEY = A has overlapped periods in February so the number of charged periods can be > 28.

            DAYS   AMOUNT
A 2018_01    27  10.250000
  2018_02    43  12.016667
  2018_03    31   3.616667
  2018_04     1   0.116667
B 2018_01    28   1.272727
  2018_02    28   1.272727
  2018_03    31   1.875598
  2018_04    30   2.030075
  2018_05    31   2.097744
  2018_06    30   2.030075
  2018_07    21   1.421053
C 2018_01    31  11.000000
  2018_02     0   0.000000

I came up with the solution detailed below but it is highly inefficient and takes an unaffordable amount of time to run for a dataset with ~100 million rows. I am looking for an improved version but could not manage to vectorize the pd.date_range part of it. Not sure if numba @jit could help here? Added a tag just in case.

from pandas.tseries.offsets import MonthEnd
# Prepare the final dataframe (filled with zeros)
bounds = df.groupby('KEY').agg({'START_DATE': min, 'STOP_DATE':max}).reset_index()
multiindex = []
for row in bounds.itertuples():
    dates = pd.date_range(start=row.START_DATE, end=row.STOP_DATE + MonthEnd(),
                          freq='M').strftime('%Y_%m')
    multiindex.extend([(row.KEY, date) for date in dates])
index = pd.MultiIndex.from_tuples(multiindex)
final = pd.DataFrame(0, index=index, columns=['DAYS', 'AMOUNT'])

# Run the actual iteration over rows
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
for row in df.itertuples():
    data = pd.Series(index=pd.date_range(start=row.START_DATE, end=row.STOP_DATE))
    data = data.resample('MS').size().rename('DAYS').to_frame()
    data['AMOUNT'] = data.DAYS / row.TOTAL_DAYS * row.AMOUNT
    data.index = data.index.strftime('%Y_%m')
    # Add data to the final dataframe
    final.loc[(row.KEY, data.index.tolist()), 'DAYS'] += data.DAYS.values
    final.loc[(row.KEY, data.index.tolist()), 'AMOUNT'] += data.AMOUNT.values

回答1:

I eventually came up with this solution (heavily inspired from @jezrael answer on this post). Probably not the most memory efficient solution but this is not a major concern for me, execution time was the problem!

from pandas.tseries.offsets import MonthBegin

df['ID'] = range(len(df))
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
df
>>>   AMOUNT KEY  START_DATE  STOP_DATE   ID  TOTAL_DAYS
0       5     A   2018-01-05  2018-01-22   0      18
1       3     B   2018-01-04  2018-03-10   1      66
2      11     C   2018-01-01  2018-01-31   2      31
3      14     A   2018-01-23  2018-02-15   3      24
4       7     A   2018-02-01  2018-04-01   4      60
5       9     B   2018-03-11  2018-07-21   5     133

final = (df[['ID', 'START_DATE', 'STOP_DATE']].set_index('ID').stack()
           .reset_index(level=-1, drop=True)
           .rename('DATE_AFTER')
           .to_frame())
final = final.groupby('ID').apply(
    lambda x: x.set_index('DATE_AFTER').resample('M').asfreq()).reset_index()
final = final.merge(df[['ID', 'KEY', 'AMOUNT', 'TOTAL_DAYS']], how='left', on=['ID'])
final['PERIOD'] = final.DATE_AFTER.dt.to_period('M')
final['DATE_BEFORE'] = final.DATE_AFTER - MonthBegin()

At this point final looks like this:

final
>>> ID DATE_AFTER KEY  AMOUNT  TOTAL_DAYS  PERIOD DATE_BEFORE
0    0 2018-01-31   A       5          18 2018-01  2018-01-01
1    1 2018-01-31   B       3          66 2018-01  2018-01-01
2    1 2018-02-28   B       3          66 2018-02  2018-02-01
3    1 2018-03-31   B       3          66 2018-03  2018-03-01
4    2 2018-01-31   C      11          31 2018-01  2018-01-01
5    3 2018-01-31   A      14          24 2018-01  2018-01-01
6    3 2018-02-28   A      14          24 2018-02  2018-02-01
7    4 2018-02-28   A       7          60 2018-02  2018-02-01
8    4 2018-03-31   A       7          60 2018-03  2018-03-01
9    4 2018-04-30   A       7          60 2018-04  2018-04-01
10   5 2018-03-31   B       9         133 2018-03  2018-03-01
11   5 2018-04-30   B       9         133 2018-04  2018-04-01
12   5 2018-05-31   B       9         133 2018-05  2018-05-01
13   5 2018-06-30   B       9         133 2018-06  2018-06-01
14   5 2018-07-31   B       9         133 2018-07  2018-07-01

We then merge back the initial df twice (start and end of month):

final = pd.merge(
    final,
    df[['ID', 'STOP_DATE']].assign(PERIOD = df.STOP_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])

final = pd.merge(
    final,
    df[['ID', 'START_DATE']].assign(PERIOD = df.START_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])

final['STOP_DATE'] = final.STOP_DATE.combine_first(final.DATE_AFTER)
final['START_DATE'] = final.START_DATE.combine_first(final.DATE_BEFORE)

final['DAYS'] = (final.STOP_DATE- final.START_DATE).dt.days + 1

final = final.drop(columns=['ID', 'DATE_AFTER', 'DATE_BEFORE'])
final.AMOUNT *= final.DAYS/final.TOTAL_DAYS
final = final.groupby(['KEY', 'PERIOD']).agg({'AMOUNT': sum, 'DAYS': sum})

With the expected result:

             AMOUNT     DAYS
KEY PERIOD                  
A   2018-01  10.250000    27   
    2018-02  12.016667    43
    2018-03   3.616667    31
    2018-04   0.116667     1
B   2018-01   1.272727    28
    2018-02   1.272727    28
    2018-03   1.875598    31
    2018-04   2.030075    30
    2018-05   2.097744    31
    2018-06   2.030075    30
    2018-07   1.421053    21
C   2018-01  11.000000    31