I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps
function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.
What I am doing is joining two data tables where the time
in one table falls between the start_time
and end_time
in another table. The periodicity of the two tables is not the same - table_A
occurs on a per second basis and can have multiple entries at each interval, while table_B
will have one entry every 0 - 10 minutes at irregular intervals.
This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others
The following code provides my desired output in R:
# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]
# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)
# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]
> head(table_A)
time x y dummy
1: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
2: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
3: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
4: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
5: 2016-07-11 11:52:32 4077 1 2016-07-11 11:52:32
6: 2016-07-11 11:52:32 4077 1 2016-07-11 11:52:32
> head(table_B)
x y start_time end_time
1: 6183 1 2016-07-11 12:00:45 2016-07-11 12:00:56
2: 6183 1 2016-07-11 12:01:20 2016-07-11 12:01:20
3: 6183 1 2016-07-11 12:01:40 2016-07-11 12:03:26
4: 6183 1 2016-07-11 12:04:20 2016-07-11 12:04:40
5: 6183 1 2016-07-11 12:04:55 2016-07-11 12:04:57
6: 6183 1 2016-07-11 12:05:40 2016-07-11 12:05:51
So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.
One thing that may not be apparent from the example data is that multiple x and y values occur at time
s within the same start_time
and end_time
s.
> head(joined)
y x start_time end_time time
1 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
2 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
3 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
4 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
5 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
6 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:55