I'd like to merge two dataframes, df1 & df2, based on whether rows of df2 fall within a 3-6 month date range after rows of df1. For example:
df1 (for each company I have quarterly data):
company DATADATE
0 012345 2005-06-30
1 012345 2005-09-30
2 012345 2005-12-31
3 012345 2006-03-31
4 123456 2005-01-31
5 123456 2005-03-31
6 123456 2005-06-30
7 123456 2005-09-30
df2 (for each company I have event dates that can happen on any day):
company EventDate
0 012345 2005-07-28 <-- won't get merged b/c not within date range
1 012345 2005-10-12
2 123456 2005-05-15
3 123456 2005-05-17
4 123456 2005-05-25
5 123456 2005-05-30
6 123456 2005-08-08
7 123456 2005-11-29
8 abcxyz 2005-12-31 <-- won't be merged because company not in df1
Ideal merged df -- rows with EventDates in df2 that are 3-6 months (i.e. 1 quarter) after DATADATEs in rows of df1 will be merged:
company DATADATE EventDate
0 012345 2005-06-30 2005-10-12
1 012345 2005-09-30 NaN <-- nan because no EventDates fell in this range
2 012345 2005-12-31 NaN
3 012345 2006-03-31 NaN
4 123456 2005-01-31 2005-05-15
5 123456 2005-01-31 2005-05-17
5 123456 2005-01-31 2005-05-25
5 123456 2005-01-31 2005-05-30
6 123456 2005-03-31 2005-08-08
7 123456 2005-06-30 2005-11-19
8 123456 2005-09-30 NaN
I am trying to apply this related topic [ Merge pandas DataFrames based on irregular time intervals ] by adding start_time and end_time columns to df1 denoting 3 months (start_time) to 6 months (end_time) after DATADATE, then using np.searchsorted(), but this case is a bit trickier because I'd like to merge on a company-by-company basis.
This is my solution going off of the algorithm that Ami Tavory suggested below:
This is actually one of those rare questions where the algorithmic complexity might be significantly different for different solutions. You might want to consider this over the niftiness of 1-liner snippets.
Algorithmically:
sort the larger of the dataframes according to the date
for each date in the smaller dataframe, use the
bisect
module to find the relevant rows in the larger dataframeFor dataframes with lengths m and n, respectively (m < n) the complexity should be O(m log(n)).