I have a large data set that has unique IDs for individuals as well as dates, and each individual is capable of multiple encounters.
The below is code and an example of how this data might look:
strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16",
"6/8/16", "5/8/16","2/3/16","1/1/16")
Date<-as.Date(strDates, "%m/%d/%y")
ID <- c("A", "A", "A", "A","A","B","B","B","B","B")
Event <- c(1,0,1,0,1,0,1,1,1,0)
sample_df <- data.frame(Date,ID,Event)
sample_df
Date ID Event
1 2016-09-09 A 1
2 2016-06-07 A 0
3 2016-05-06 A 1
4 2016-02-03 A 0
5 2016-02-01 A 1
6 2016-11-08 B 0
7 2016-06-08 B 1
8 2016-05-08 B 1
9 2016-02-03 B 1
10 2016-01-01 B 0
I want to keep all attached information per encounter, but then aggregate the following historical information by id
- Number of Previous Encounters
- Number of Previous Events
As an example, let's look at Row 2.
Row 2 is ID A, so I would reference Rows 3-5 (which occurred prior to Row 2 Encounter). Within this group of rows, we see that Row 3 & 5 both had events.
Number of Previous Encounters for Row 2 = 3
Number of Previous Events for Row 2 = 2
Ideally, I would get the following output:
Date ID Event PrevEnc PrevEvent
1 2016-09-09 A 1 4 2
2 2016-06-07 A 0 3 2
3 2016-05-06 A 1 2 1
4 2016-02-03 A 0 1 1
5 2016-02-01 A 1 0 0
6 2016-11-08 B 0 4 3
7 2016-06-08 B 1 3 2
8 2016-05-08 B 1 2 1
9 2016-02-03 B 1 1 0
10 2016-01-01 B 0 0 0
So far, I have tried working this problem in dplyr with mutate as well as summarise, both of which have not let me successfully restrict my aggregation to events that occurred previously for a specific ID. I have tried some messy For-loops with If-then statements, but really just wondering if a package or technique exists to simplify this process.
Thank you!
Or, if you want to try with
data.table
, you can use this:If you didn't know this package, there is an excellent cheat sheet for most of the operations.
The biggest impediment is the current sort order. Here, I stored an original index point, which I later used to re-sort the data (then removed it). Other than that, the basic idea is to count up from 0 for the encounters, and to use
cumsum
to count the events as they happen. To that end,lag
is used to avoid counting the current event.Gives
As @Frank and @MarkPeterson point out, the biggest hurdle here is that the
Date
column is sorted in descending order. Another approach that does not require resorting theDate
column:Here, we use
row_number()
to determine the row index andn()
to determine the number of rows (grouped byID
). Since theDate
is sorted in descending order, the number of previous encounter is simplyn()-row_number()
. To compute the number of previous events, we again exploit the fact that theDate
column is sorted in descending order and userev
to reverse the order of theEvent
column beforecumsum
thelag
of this reversed column. We then userev
again to reverse the result back to the original order.Using your data: