R foverlaps equivalent in Python

2019-02-19 19:58发布

问题:

I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.

What I am doing is joining two data tables where the time in one table falls between the start_time and end_time in another table. The periodicity of the two tables is not the same - table_A occurs on a per second basis and can have multiple entries at each interval, while table_B will have one entry every 0 - 10 minutes at irregular intervals.

This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others

The following code provides my desired output in R:

# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]

# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)

# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]


> head(table_A)
   time                         x       y     dummy
1: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27 
2: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
3: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
4: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
5: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32
6: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32


> head(table_B)
                x       y   start_time              end_time
1:              6183    1   2016-07-11 12:00:45     2016-07-11 12:00:56 
2:              6183    1   2016-07-11 12:01:20     2016-07-11 12:01:20   
3:              6183    1   2016-07-11 12:01:40     2016-07-11 12:03:26  
4:              6183    1   2016-07-11 12:04:20     2016-07-11 12:04:40  
5:              6183    1   2016-07-11 12:04:55     2016-07-11 12:04:57  
6:              6183    1   2016-07-11 12:05:40     2016-07-11 12:05:51  

So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.

One thing that may not be apparent from the example data is that multiple x and y values occur at times within the same start_time and end_times.

> head(joined)
  y      x      start_time              end_time                time 
1 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
2 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
3 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
4 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
5 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
6 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:55 

回答1:

Consider a straightforward merge with subset using pandas.Series.between(). Merge joins all combinations of the join columns and the subset keeps rows that align to time intervals.

df = pd.merge(table_A, table_B, on=['x', 'y'])                   
df = df[df['time'].between(df['start_time'], df['end_time'], inclusive=True)]

However, one important item is your dates should be casted as datetime type. Currently, your post shows string dates which affects above .between(). Below assumes US dates with month first as MM/DD/YYYY. Either you can convert types during file read in:

dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M:%S')

table_A = pd.read_csv('data.csv', parse_dates=[0], date_parser=dateparse, dayfirst=False)

table_B = pd.read_csv('data.csv', parse_dates=[0,1], date_parser=dateparse, dayfirst=False)

Or after read in:

table_A['time'] = pd.to_datetime(table_A['time'], format='%m/%d/%Y %H:%M:%S')

table_B['start_time'], table_B['end_time']=(pd.to_datetime(ser, format='%m/%d/%Y %H:%M:%S') \
                                    for ser in [table_B['start_time'], table_B['end_time']])