Rolling Count of Events Over Time Series

2019-07-02 06:25发布

I'm trying to calculate a rolling count/sum of occurrences by group over the series of a time frame.

I have a data frame with some sample data like this:

dates = as.Date(c("2011-10-09",
        "2011-10-15",
        "2011-10-16", 
        "2011-10-18", 
        "2011-10-21", 
        "2011-10-22", 
        "2011-10-24"))

group1=c("A",
         "C",
         "A", 
         "A", 
         "L", 
         "F", 
         "A")
group2=c("D",
         "A",
         "B", 
         "H", 
         "A", 
         "A", 
         "E")

df1 <- data.frame(dates, group1, group2)

I iterate individual data frames for each unique 'group', so for example this is how the group for "A" would look (they are present in every row, whether in group1 or group2).

I want to count for "A" (and then each group later on) the number of event occurrences in a time range - the 'date' of the event (i.e., the present row date) and the previous 4 days. I want to roll that forward, so for example row 1 would have a count of 1, row 2 would also have a count of 1 (no events in the past 4 days aside from that present date), row 3 would have 2, row 4 would have 3 etc.

For each row, I'd like to end up with a column that basically says, on this event date, there are X number of events that have occurred on the present date (as indicated in the date column) and in the last 4 days.

2条回答
等我变得足够好
2楼-- · 2019-07-02 06:48

I think, but am not sure, that you're looking for a way to count occurrences of each event type (letter) on each date (row) and the preceding four days, whether or not those preceding four days appear in your data. If that's right, then here's one approach using dplyr (for general convenience), tidyr (to make the wide data long for easier counting by date), and zoo (for its rollapply function).

library(dplyr)
library(tidyr)
library(zoo)

df2 <- df1 %>%
  # make the wide data long so we can group and then count by date
  gather(key = group, value = event, group1:group2) %>%
  # group by date
  group_by(dates) %>%
  # count occurrences of the event of interest on each date
  summarise(sum.a = sum(event == "A")) %>%
  # join that set of counts to a complete date sequence
  left_join(data.frame(dates = seq(first(dates), last(dates), by = "day")), .) %>%
  # use rollapply to get sums of those counts across rolling windows that
  # are 4 days wide and right-aligned
  mutate(sum.a = rollapply(sum.a, width = 4, sum, na.rm = TRUE,
                               partial = TRUE, align = "right")) %>%
  # filter back to the original set of dates in df1
  filter(dates %in% df1$dates)

Result:

> df2
       dates sum.a
1 2011-10-09     1
2 2011-10-15     1
3 2011-10-16     2
4 2011-10-18     3
5 2011-10-21     2
6 2011-10-22     2
7 2011-10-24     3
查看更多
Anthone
3楼-- · 2019-07-02 06:49

For this example, you can probably use sapply to analyze each row, counting the number of entries on that day or up to 4 days earlier, like so:

df1$lastFour <-
  sapply(df1$dates, function(x){
    sum(df1$dates <= x & df1$dates >= x - 4)
  })

Results in df1 of:

       dates group1 group2 lastFour
1 2011-10-09      A      D        1
2 2011-10-15      C      A        1
3 2011-10-16      A      B        2
4 2011-10-18      A      H        3
5 2011-10-21      L      A        2
6 2011-10-22      F      A        3
7 2011-10-24      A      E        3

If, as your question implies, your data are from a larger set and you want to do the analysis on each group (conceptually, I think the question is: how many events have had this group in the last four days? asked only on days with an event from that group), you could follow the steps below.

First, here are some larger sample data with groups labelled as the first 10 letters of the alphabet:

biggerData <-
  data.frame(
    dates = sample(seq(as.Date("2011-10-01")
                       , as.Date("2011-10-31")
                       , 1)
                   , 100, TRUE)
    , group1 = sample(LETTERS[1:10], 100, TRUE)
    , group2 = sample(LETTERS[1:10], 100, TRUE)
  )

Next, I extract all of the groups in the data (here, I know them, but for your real data, you may or may not have that list of groups already)

groupsInData <-
  sort(unique(c(as.character(biggerData$group1)
                , as.character(biggerData$group2))))

Then, I loop through that vector of group names and extract each of the events with that group as one of the two groups, adding the same column as above, and saving the separate data.frames in a list (and naming them to make it easier to access/track them).

sepGroupCounts <- lapply(groupsInData, function(thisGroup){
  dfTemp <- biggerData[biggerData$group1 == thisGroup | 
                         biggerData$group2 == thisGroup, ]

  dfTemp$lastFour <-
    sapply(dfTemp$dates, function(x){
      sum(dfTemp$dates <= x & dfTemp$dates >= x - 4)
    })
  return(dfTemp)

}) 

names(sepGroupCounts) <- groupsInData

returns a data.frame just like above for each of the groups in your data.

And, I couldn't help myself, so here is a dplyr and tidyr solution as well. It is not much different than the list-based solution above, except that it returns everything in the same data.frame (which may or may not be a good thing, particularly as it will have two entries for each event this way).

First, for simplicity, I defined a function to do the date checking. This could easily be used above as well.

myDateCheckFunction <- function(x){
  sapply(x, function(thisX){
    sum(x <= thisX & x >= thisX - 4 )
  })
}

Next, I am constructing a set of logical tests that will determine whether or not each of the groups is present. These will be used to generate columns for each group, giving TRUE/FALSE for present/absent in each event.

dotsConstruct <-
  paste0("group1 == '", groupsInData, "' | "
         , "group2 == '", groupsInData, "'") %>%
  setNames(groupsInData)

Finally, putting it altogether in one piped call. Instead of describing, I have commented each step.

withLastFour <-
  # Start with data
  biggerData %>%
  # Add a col for each group using Standard Evaluation
  mutate_(.dots = dotsConstruct) %>%
  # convert to long form; one row per group per event
  gather(GroupAnalyzed, Present, -dates, -group1, -group2) %>%
  # Limit to only rows where the `GroupAnalyzed` is present
  filter(Present) %>%
  # Remove the `Present` column, as it is now all "TRUE"
  select(-Present) %>%
  # Group by the groups we are analyzing
  group_by(GroupAnalyzed) %>%
  # Add the column for count in the last four dates
  # `group_by` limits this to just counts within that group
  mutate(lastFour = myDateCheckFunction(dates)) %>%
  # Sort by group and date for prettier checking
  arrange(GroupAnalyzed, dates)

The result is similar to the above list output, except with everything in one data.frame, which may allow for easier analysis of some features. The top looks like this:

       dates group1 group2 GroupAnalyzed lastFour
      <date> <fctr> <fctr>         <chr>    <int>
1 2011-10-01      B      A             A        1
2 2011-10-02      J      A             A        2
3 2011-10-05      C      A             A        5
4 2011-10-05      C      A             A        5
5 2011-10-05      G      A             A        5
6 2011-10-08      E      A             A        5

Note that my random sample had multiple events on Oct-05, leading to the large counts here.

查看更多
登录 后发表回答