sum over past window-size dates per group

2020-02-14 10:20发布

问题:

The problem is similar to How do I do a conditional sum which only looks between certain date criteria but slightly different and the answer from that does not fit into current problem. The main difference is that the date column based on each group may not necessarily be complete (i.e., certain date may be missing)

Input:

input <- read.table(text="
2017-04-01     A     1
2017-04-02     B     2
2017-04-02     B     2
2017-04-02     C     2
2017-04-02     A     2
2017-04-03     C     3
2017-04-04     A     4
2017-04-05     B     5
2017-04-06     C     6
2017-04-07     A     7
2017-04-08     B     8
2017-04-09     C     9")
colnames(input) <- c("Date","Group","Score")

Rule: for each group at each date, looking back 3 calendar dates (include current date). calculate the sum.

Expected output:

    Date Group 3DaysSumPerGroup
    2017-04-01     A                1 #1  previous two dates are not available. partial is allowed
    2017-04-02     A                3 #2+1 both 4-01 and 4-02 are in the range
    2017-04-04     A                6 #4+2
    2017-04-07     A                7 #7
    2017-04-02     B                4 # 2+2 at the same day
    2017-04-05     B                5
    2017-04-08     B                8
    2017-04-02     C                2
    2017-04-03     C                5
    2017-04-06     C                6
    2017-04-09     C                9

I tried to use rollapply with partial=T, but result doesn't seem correct.

 input %>% 
     group_by(Group) %>% 
     arrange(Date) %>% mutate("3DaysSumPerGroup"=rollapply(data=Score,width=3,align="right",FUN=sum,partial=T,fill=NA,rm.na=T))

回答1:

Here's a (supposedly efficient) solution using the new non-equi joins and the by = .EACHI features in data.table (v1.9.8+)

library(data.table) #v1.10.4

## Convert to a proper date class, and add another column in order to define the range
setDT(input)[, c("Date", "Date2") := {
  Date = as.IDate(Date)
  Date2 = Date - 2L
  .(Date, Date2)
}]

## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column 

input[unique(input, by = c("Date", "Group")), ## This removes the dupes
      on = .(Group, Date <= Date, Date >= Date2), ## The join condition
      .(Score = sum(Score)), ## sum the scores
      keyby = .EACHI] ## Run the sum by each row in unique(input, by = c("Date", "Group"))

#     Group       Date       Date Score
#  1:     A 2017-04-01 2017-03-30     1
#  2:     A 2017-04-02 2017-03-31     3
#  3:     A 2017-04-04 2017-04-02     6
#  4:     A 2017-04-07 2017-04-05     7
#  5:     B 2017-04-02 2017-03-31     4
#  6:     B 2017-04-05 2017-04-03     5
#  7:     B 2017-04-08 2017-04-06     8
#  8:     C 2017-04-02 2017-03-31     2
#  9:     C 2017-04-03 2017-04-01     5
# 10:     C 2017-04-06 2017-04-04     6
# 11:     C 2017-04-09 2017-04-07     9