How to most efficiently filter a dataframe conditi

I have a dataframe df1 with an ID column and a lubridate time interval column, and I want to filter (subsample) a dataframe df2, which has ID and DateTime columns, so that only df2 rows with DateTime fitting the corresponding ID interval in df1 are kept. I want to do so in a tidyverse framework.
It can easily be done using a join (see example below), but I would like to know whether there would be a more direct solution (maybe purrr-based) that would avoid joining and then removing the time-interval data from the second dataframe. Thanks.

The question posted here Merge two dataframes if timestamp of x is within time interval of y is close to the one asked here but proposed solution were similar to the one I developed and not in a tidyverse framework.

A minimal code to show the problem and my current solution:

library(tibble)  
library(lubridate)

df1 <- tribble(
  ~ID, ~Date1, ~Date2,
  "ID1", "2018-04-16", "2018-06-14",
  "ID2", "2018-04-20", "2018-06-25") 
df1 <- mutate(df1,Interval = interval(ymd(Date1),ymd(Date2)))

df2 <- tribble(
  ~ID, ~DateTime,
  "ID1", "2018-04-12",
  "ID1", "2018-05-05",
  "ID2", "2018-04-23",
  "ID2", "2018-07-12")
df2 <- mutate(df2,DateTime=ymd(DateTime))

df1 looks like this

> df1
# A tibble: 2 x 4
  ID    Date1      Date2      Interval                      
  <chr> <chr>      <chr>      <S4: Interval>                
1 ID1   2018-04-16 2018-06-14 2018-04-16 UTC--2018-06-14 UTC
2 ID2   2018-04-20 2018-06-25 2018-04-20 UTC--2018-06-25 UTC

and df2 like this:

> df2
# A tibble: 4 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-04-12
2 ID1   2018-05-05
3 ID2   2018-04-23
4 ID2   2018-07-12

In df2, the second record for ID1 is not within the ID1 interval in df1. The second record for ID2 is also not within the ID2 interval in df1.

My current solution based on joining and the removing the joined column follows:

df_out <- df2 %>%
  left_join(.,df1,by="ID") %>%
  filter(.,DateTime %within% Interval) %>%
  select(.,-Interval)

> df_out
# A tibble: 2 x 4
  ID    DateTime   Date1      Date2     
  <chr> <date>     <chr>      <chr>     
1 ID1   2018-05-05 2018-04-16 2018-06-14
2 ID2   2018-04-23 2018-04-20 2018-06-25

I have the feeling a tidyverse alternative that would avoid joining and then removing the Interval column should exist.

标签： r tidyverse purrr

1条回答

男人必须洒脱

2楼-- · 2019-08-30 04:12

There is a package called fuzzyjoin that can do a semi_join based on an interval. Semi join means that it filters the "left" dataframe depending on match to the "right" dataframe. Try:

library(fuzzyjoin)
df2 %>% 
  fuzzy_semi_join(df1, 
                  by=c("DateTime"="Date1", "DateTime"="Date2"),
                  match_fun=list(`>=`, `<=`))

Which gives the result:

# A tibble: 2 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-05-05
2 ID2   2018-04-23

0人赞添加讨论(0) 举报

How to most efficiently filter a dataframe conditi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间