Test if date occurs in multiple date ranges with R

2019-04-01 10:45发布

问题:

I have a data frame with multiple date ranges (45 to be exact):

Range  Start       End
1      2014-01-01  2014-02-30
2      2015-01-10  2015-03-30
3      2016-04-20  2016-10-12
...    ...         ...

They will never overlap

I also have a data frame with various event dates (200K+):

Event  Date
1      2014-01-02
2      2014-03-20
3      2015-04-01
4      2016-08-18
...    ...

I want to test if these dates fall within any of these ranges:

Event  Date        InRange
1      2014-01-02  TRUE
2      2014-03-20  FALSE
3      2015-04-01  FALSE
4      2016-08-18  TRUE
...

What is the best way to perform this test? I have looked at lubridate's between and interval functions as well as various Stackoverflow questions, but cannot find a good solution.

回答1:

You can create a vector of your date range from the first data frame, then use %in% operator to check if each date of your events is in this date range. Assuming your first data frame is dateRange, and second events, putting the above logic in one line would be:

events$InRange <- events$Date %in% unlist(Map(`:`, dateRange$Start, dateRange$End))

events
  Event       Date InRange
1     1 2014-01-02    TRUE
2     2 2014-03-20   FALSE
3     3 2015-04-01   FALSE
4     4 2016-08-18    TRUE

Where we used the Map to create the date range vector. Map combined with : operator create a list of date range from the Start to the End. Somewhere close to list(2014-01-01 : 2014-02-30, 2015-01-10 : 2015-03-30, 2016-04-20 : 2016-10-12 ...)(symbolically, not valid), with the unlist, we flatten it as a vector of date range which could then be used with %in% conveniently.



回答2:

Having ordered, non-overlapping intervals in your first "data.frame", you could test -for each event date- if it is above a $Start and its respective $End. Using findInterval to reduce relational comparisons and memory needed.

findInterval(events$Date, ranges$Start) > findInterval(events$Date, ranges$End)
#[1]  TRUE FALSE FALSE  TRUE

With data (modified "2014-02-30"):

ranges = structure(list(Range = 1:3, Start = structure(c(16071, 16445, 
16911), class = "Date"), End = structure(c(16129, 16524, 17086
), class = "Date")), .Names = c("Range", "Start", "End"), row.names = c(NA, 
-3L), class = "data.frame")

events = structure(list(Event = 1:4, Date = structure(c(16072, 16149, 
16526, 17031), class = "Date")), .Names = c("Event", "Date"), row.names = c(NA, 
-4L), class = "data.frame")


回答3:

Write your own function to check if a list of dates are in any of a number of intervals.

date.in <- function(x){
m <- NULL
for (i in 1:NROW(df)){m <- c(m,  ifelse(x>=df[i,1] & x<=df[i,2], TRUE, FALSE))}
any(m)}

Data:

df <- data.frame(start=c("2014-01-01", "2015-01-10", "2016-04-20"), 
       end=c("2014-02-30", "2015-03-30", "2016-10-12"))
df[] <- lapply(df, as.character)

s <- c("2014-01-02", "2014-03-20", "2015-04-01", "2016-08-18")

Test using string s.

as.character(lapply(s, date.in))#TRUE FALSE FALSE TRUE