I'm trying to calculate a rolling count/sum of occurrences by group over the series of a time frame.
I have a data frame with some sample data like this:
dates = as.Date(c("2011-10-09",
"2011-10-15",
"2011-10-16",
"2011-10-18",
"2011-10-21",
"2011-10-22",
"2011-10-24"))
group1=c("A",
"C",
"A",
"A",
"L",
"F",
"A")
group2=c("D",
"A",
"B",
"H",
"A",
"A",
"E")
df1 <- data.frame(dates, group1, group2)
I iterate individual data frames for each unique 'group', so for example this is how the group for "A" would look (they are present in every row, whether in group1 or group2).
I want to count for "A" (and then each group later on) the number of event occurrences in a time range - the 'date' of the event (i.e., the present row date) and the previous 4 days. I want to roll that forward, so for example row 1 would have a count of 1, row 2 would also have a count of 1 (no events in the past 4 days aside from that present date), row 3 would have 2, row 4 would have 3 etc.
For each row, I'd like to end up with a column that basically says, on this event date, there are X number of events that have occurred on the present date (as indicated in the date column) and in the last 4 days.
I think, but am not sure, that you're looking for a way to count occurrences of each event type (letter) on each date (row) and the preceding four days, whether or not those preceding four days appear in your data. If that's right, then here's one approach using
dplyr
(for general convenience),tidyr
(to make the wide data long for easier counting by date), andzoo
(for itsrollapply
function).Result:
For this example, you can probably use
sapply
to analyze each row, counting the number of entries on that day or up to 4 days earlier, like so:Results in
df1
of:If, as your question implies, your data are from a larger set and you want to do the analysis on each group (conceptually, I think the question is: how many events have had this group in the last four days? asked only on days with an event from that group), you could follow the steps below.
First, here are some larger sample data with groups labelled as the first 10 letters of the alphabet:
Next, I extract all of the groups in the data (here, I know them, but for your real data, you may or may not have that list of groups already)
Then, I loop through that vector of group names and extract each of the events with that group as one of the two groups, adding the same column as above, and saving the separate data.frames in a list (and naming them to make it easier to access/track them).
returns a data.frame just like above for each of the groups in your data.
And, I couldn't help myself, so here is a
dplyr
andtidyr
solution as well. It is not much different than the list-based solution above, except that it returns everything in the same data.frame (which may or may not be a good thing, particularly as it will have two entries for each event this way).First, for simplicity, I defined a function to do the date checking. This could easily be used above as well.
Next, I am constructing a set of logical tests that will determine whether or not each of the groups is present. These will be used to generate columns for each group, giving TRUE/FALSE for present/absent in each event.
Finally, putting it altogether in one piped call. Instead of describing, I have commented each step.
The result is similar to the above
list
output, except with everything in one data.frame, which may allow for easier analysis of some features. The top looks like this:Note that my random sample had multiple events on Oct-05, leading to the large counts here.