How to perform a join based on intervals with dply

2019-08-18 00:21发布

问题:

I have a data frame containing two columns: a grouping variable and a interval period over which the grouping variable holds. I have another data frame with a date column and a value column. How can I join these two tables together somewhat efficiently with dplyr+tidyverse functions?

library(dplyr)
library(lubridate)
ty <- data_frame(date = mdy(paste(1, 1 + seq(20), 2017, sep = "/")), 
                 y = c(rnorm(7), rnorm(7, mean = 2), rnorm(6, mean = -1)))
gy <- data_frame(period = interval(mdy(c("01/01/2017", "01/08/2017", "01/15/2017")), 
                                   mdy(c("01/07/2017", "01/14/2017", "01/20/2017"))), 
                          batch = c(1, 2, 3))

I want to build the table that is equivalent to:

ty %>% mutate(batch = c(rep(1, 7), rep(2, 7), rep(3, 6)))

Ideally, this should work reasonably quickly on data sets of up to 1,000,000 rows. Better still if it works on 100,000,000 :).

回答1:

How about:

ty %>% 
  mutate(batch = case_when(
  ty$date %within% gy$period[1] ~gy$batch[1],
  ty$date %within% gy$period[2] ~gy$batch[2],
  ty$date %within% gy$period[3] ~gy$batch[3]))

You would obviously need to define the case_when intervals. How many have you got? I've used cat and paste0 with good effect for that in the past.

Edited to reflect OP's comments. This should take care of the NSE and would allow the generation of the case_when intervals programatically:

ty %>%
  mutate(batch = eval(parse(text = paste0("case_when(",
                                      paste(
                                        paste0(
                                          "ty$date %within% gy$period[",
                                          seq_along(gy$period),
                                          "] ~gy$batch[",
                                          seq_along(gy$period),
                                          "]"
                                        ),
                                        collapse = ", "
                                      ), ")"))))


回答2:

This is the best I could come up with so far:

ty$batch <- unlist(lapply(ty$date, function(d) gy$batch[which(d %within% gy$period)]), recursive = FALSE, use.names = FALSE)

But it doesn't seem very fast.