Note: Similar question I have asked for SQL - How to use a window function to determine when to perform different tasks in Hive or Postgres?
Data
I have a some data showing the start day and end day for different pre-prioritised tasks per person:
input_df <- data.frame(person = c(rep("Kate", 2), rep("Adam", 2), rep("Eve", 2), rep("Jason", 5)),
task_key = c(c("A","B"), c("A","B"), c("A","B"), c("A","B","C","D","E")),
start_day = c(c(1L,1L), c(1L,2L), c(2L,1L), c(1L,4L,3L,5L,4L)),
end_day = 5L)
person task_key start_day end_day 1 Kate A 1 5 2 Kate B 1 5 3 Adam A 1 5 4 Adam B 2 5 5 Eve A 2 5 6 Eve B 1 5 7 Jason A 1 5 8 Jason B 4 5 9 Jason C 3 5 10 Jason D 5 5 11 Jason E 4 5
NOTE: Task key is ordered so that higher letters have higher priorities.
Question
I need to work out which task each person should be working on each day, with the condition that:
- Higher lettered tasks take priority over lower lettered tasks.
- If a higher lettered task overlaps any part of a lower lettered task, then the lower lettered task gets set to NA (to represent that the person should not work on it ever).
Simplification
In the real data the end_day is always 5 in the original table i.e. only the start_day varies but the end_day is constant. This means my desired output will have the same number of rows as my original table :)
Output
This is the sort of output I need (Jason is more representative of the data I have which can be over 100 tasks covering a period of 90 days):
output_df <- data.frame(person = c(rep("Kate", 2), rep("Adam", 2), rep("Eve", 2), rep("Jason", 5)),
task_key = c(c("A","B"), c("A","B"), c("A","B"), c("A","B","C","D","E")),
start_day = c(c(1L,1L), c(1L,2L), c(2L,1L), c(1L,4L,3L,5L,4L)),
end_day = 5L,
valid_from = c( c(NA,1L), c(1L,2L), c(NA,1L), c(1L,NA,3L,NA,4L) ),
valid_to = c( c(NA,5L), c(2L,5L), c(NA,5L), c(3L,NA,4L,NA,5L) ))
person task_key start_day end_day valid_from valid_to 1 Kate A 1 5 NA NA 2 Kate B 1 5 1 5 3 Adam A 1 5 1 2 4 Adam B 2 5 2 5 5 Eve A 2 5 NA NA 6 Eve B 1 5 1 5 7 Jason A 1 5 1 3 8 Jason B 4 5 NA NA 9 Jason C 3 5 3 4 10 Jason D 5 5 NA NA 11 Jason E 4 5 4 5
Initial Thoughts
Works but I want a solution that works using the dbplyr package functions and something that is generally better than this:
tmp <- input_df %>% filter(person == "Jason")
num_rows <- nrow(tmp)
tmp$valid_from <- NA
tmp$valid_to <- NA
for(i in 1:num_rows) {
# Curent value
current_value <- tmp$start_day[i]
# Values to test against
vec <- lead(tmp$start, i)
# test
test <- current_value >= vec
# result
if(any(test, na.rm = TRUE) & i!=num_rows) {
tmp$valid_from[i] <- NA
tmp$valid_to[i] <- NA
} else if(i!=num_rows) {
tmp$valid_from[i] <- current_value
tmp$valid_to[i] <- min(vec, na.rm = TRUE)
} else {
tmp$valid_from[i] <- current_value
tmp$valid_to[i] <- max(tmp$end_day, na.rm = TRUE)
}
}
tmp
person task_number start_day end_day valid_from valid_to 1 Jason A 1 5 1 3 2 Jason B 4 5 NA NA 3 Jason C 3 5 3 4 4 Jason D 5 5 NA NA 5 Jason E 4 5 4 5
Follow up question
Eventually I'll need to do this in SQL but that seems too hard. I heard that the 'dbply' package could help me here because if I can solve this using the dplyr functions then it will somehow convert that to a valid SQL query?
A solution using the tidyverse package.
map2
andunnest
are to expand the dataset.arrange(person, desc(task_key))
anddistinct(person, Days, .keep_all = TRUE)
are to remove duplicates based on the order oftask_key
. After that, we can useslice
to select the last row and manipulate the start and end dates.Interestingly, I had to do something similar earlier in the week but in a different context.
A solution using just the
dplyr
package is presented below (there is a warning at step 10 but I think it can be ignored).In terms of converting this
dplyr
solution into adbplyr
solution with associated valid SQL code is something I don't know how to do (I gave it a go but it didn't work).EDIT: In the original version of your question you had numbers instead of letters for your task key which is what I used. I didn't see you had edited your question until after I posted :)
Code with comments:
Output: