Merge Records Over Time Interval

2019-01-18 15:49发布

问题:

Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.

The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.

For example, a few records in df A:

    OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal

And from df B:

    OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00  

The desired outcome from the merge would be:

    OBS ID Time     Outcome  
    1   01 10:12:10 Normal 
    3   02 10:12:45 Weird 

Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.

Question

Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?

回答1:

Set up data

First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):

LinesA <- "OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal"

LinesB <- "OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00"

A <- At <- read.table(textConnection(LinesA), header = TRUE, 
               colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE, 
               colClasses = c("numeric", rep("character", 2)))

# in At and Bt convert times columns to "times" class

library(chron) 

At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)

sqldf with times class

Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:

library(sqldf)

out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
   where Time between StartTime and StartTime + Duration",
   method = "raw")

out$Time <- times(as.numeric(out$Time))

The result is:

> out
      OBS ID     Time Outcome
1   1 01 10:12:10  Normal
2   3 02 10:12:45   Weird

With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:

library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver 
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
    where Time between StartTime and StartTime + Duration")

sqldf with character class

Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:

sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
    where strftime('%s', Time) - strftime('%s', StartTime)
       between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")

EDIT:

A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.

EDIT:

Simplified/improved final sqldf statement.



回答2:

here is an example:

# first, merge by ID
z <- merge(A[, -1], B, by = "ID")

# convert string to POSIX time
z <- transform(z,
  s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
  dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) - 
    as.numeric(strptime("00:00:00", "%H:%M:%S")),
  tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))

# subset by time range
subset(z, s_t < tim & tim < s_t + dur)

the output:

  ID StartTime Duration Outcome OBS     Time        s_t dur        tim
1  1  10:12:06 00:00:10  Normal   1 10:12:10 1321665126  10 1321665130
2  1  10:12:06 00:00:10  Normal   2 10:12:15 1321665126  10 1321665135
7  2  10:12:30 00:00:30   Weird   3 10:12:45 1321665150  30 1321665165

OBS #2 looks to be in the range. does it make sense?



回答3:

Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.



标签: r time intervals