I have a dataframe df1 with an ID column and a lubridate time interval column, and I want to filter (subsample) a dataframe df2, which has ID and DateTime columns, so that only df2 rows with DateTime fitting the corresponding ID interval in df1 are kept. I want to do so in a tidyverse framework.
It can easily be done using a join (see example below), but I would like to know whether there would be a more direct solution (maybe purrr-based) that would avoid joining and then removing the time-interval data from the second dataframe. Thanks.
The question posted here Merge two dataframes if timestamp of x is within time interval of y is close to the one asked here but proposed solution were similar to the one I developed and not in a tidyverse framework.
A minimal code to show the problem and my current solution:
library(tibble)
library(lubridate)
df1 <- tribble(
~ID, ~Date1, ~Date2,
"ID1", "2018-04-16", "2018-06-14",
"ID2", "2018-04-20", "2018-06-25")
df1 <- mutate(df1,Interval = interval(ymd(Date1),ymd(Date2)))
df2 <- tribble(
~ID, ~DateTime,
"ID1", "2018-04-12",
"ID1", "2018-05-05",
"ID2", "2018-04-23",
"ID2", "2018-07-12")
df2 <- mutate(df2,DateTime=ymd(DateTime))
df1 looks like this
> df1
# A tibble: 2 x 4
ID Date1 Date2 Interval
<chr> <chr> <chr> <S4: Interval>
1 ID1 2018-04-16 2018-06-14 2018-04-16 UTC--2018-06-14 UTC
2 ID2 2018-04-20 2018-06-25 2018-04-20 UTC--2018-06-25 UTC
and df2 like this:
> df2
# A tibble: 4 x 2
ID DateTime
<chr> <date>
1 ID1 2018-04-12
2 ID1 2018-05-05
3 ID2 2018-04-23
4 ID2 2018-07-12
In df2, the second record for ID1 is not within the ID1 interval in df1. The second record for ID2 is also not within the ID2 interval in df1.
My current solution based on joining and the removing the joined column follows:
df_out <- df2 %>%
left_join(.,df1,by="ID") %>%
filter(.,DateTime %within% Interval) %>%
select(.,-Interval)
> df_out
# A tibble: 2 x 4
ID DateTime Date1 Date2
<chr> <date> <chr> <chr>
1 ID1 2018-05-05 2018-04-16 2018-06-14
2 ID2 2018-04-23 2018-04-20 2018-06-25
I have the feeling a tidyverse alternative that would avoid joining and then removing the Interval column should exist.
There is a package called
fuzzyjoin
that can do asemi_join
based on an interval. Semi join means that it filters the "left" dataframe depending on match to the "right" dataframe. Try:Which gives the result: