R: Aggregating History By ID By Date

2019-06-13 01:39发布

I have a large data set that has unique IDs for individuals as well as dates, and each individual is capable of multiple encounters.

The below is code and an example of how this data might look:

strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16",      
"6/8/16", "5/8/16","2/3/16","1/1/16")
Date<-as.Date(strDates, "%m/%d/%y")
ID <- c("A", "A", "A", "A","A","B","B","B","B","B")
Event <- c(1,0,1,0,1,0,1,1,1,0)
sample_df <- data.frame(Date,ID,Event)

sample_df

         Date ID Event
1  2016-09-09  A     1
2  2016-06-07  A     0
3  2016-05-06  A     1
4  2016-02-03  A     0
5  2016-02-01  A     1
6  2016-11-08  B     0
7  2016-06-08  B     1
8  2016-05-08  B     1
9  2016-02-03  B     1
10 2016-01-01  B     0

I want to keep all attached information per encounter, but then aggregate the following historical information by id

  1. Number of Previous Encounters
  2. Number of Previous Events

As an example, let's look at Row 2.

Row 2 is ID A, so I would reference Rows 3-5 (which occurred prior to Row 2 Encounter). Within this group of rows, we see that Row 3 & 5 both had events.

Number of Previous Encounters for Row 2 = 3

Number of Previous Events for Row 2 = 2

Ideally, I would get the following output:

         Date ID Event PrevEnc PrevEvent
1  2016-09-09  A     1       4         2
2  2016-06-07  A     0       3         2
3  2016-05-06  A     1       2         1
4  2016-02-03  A     0       1         1
5  2016-02-01  A     1       0         0
6  2016-11-08  B     0       4         3
7  2016-06-08  B     1       3         2
8  2016-05-08  B     1       2         1
9  2016-02-03  B     1       1         0
10 2016-01-01  B     0       0         0

So far, I have tried working this problem in dplyr with mutate as well as summarise, both of which have not let me successfully restrict my aggregation to events that occurred previously for a specific ID. I have tried some messy For-loops with If-then statements, but really just wondering if a package or technique exists to simplify this process.

Thank you!

3条回答
老娘就宠你
2楼-- · 2019-06-13 01:56

Or, if you want to try with data.table, you can use this:

library(data.table)

# Convert to data.table and sort
sample_dt <- as.data.table(sample_df)
sample_dt <- sample_dt[order(Date)]

# Count only the previous Events with 1
sample_dt[, prevEvent := ifelse(Event == 1, cumsum(Event) - 1, cumsum(Event)), by = "ID"]

# .I gives the row number, and .SD contains the Subset of the Data for each group
sample_dt[, prevEnc := .SD[,.I - 1], by = "ID"]

print(sample_dt)
          Date ID Event prevEvent prevEnc
 1: 2016-01-01  B     0         0       0
 2: 2016-02-01  A     1         0       0
 3: 2016-02-03  A     0         1       1
 4: 2016-02-03  B     1         0       1
 5: 2016-05-06  A     1         1       2
 6: 2016-05-08  B     1         1       2
 7: 2016-06-07  A     0         2       3
 8: 2016-06-08  B     1         2       3
 9: 2016-09-09  A     1         2       4
10: 2016-11-08  B     0         3       4

If you didn't know this package, there is an excellent cheat sheet for most of the operations.

查看更多
Deceive 欺骗
3楼-- · 2019-06-13 02:00

The biggest impediment is the current sort order. Here, I stored an original index point, which I later used to re-sort the data (then removed it). Other than that, the basic idea is to count up from 0 for the encounters, and to use cumsum to count the events as they happen. To that end, lag is used to avoid counting the current event.

sample_df %>%
  mutate(origIndex = 1:n()) %>%
  group_by(ID) %>%
  arrange(ID, Date) %>%
  mutate(PrevEncounters = 0:(n() -1)
         , PrevEvents = cumsum(lag(Event, default = 0))) %>%
  arrange(origIndex) %>%
  select(-origIndex)

Gives

         Date     ID Event PrevEncounters PrevEvents
       <date> <fctr> <dbl>          <int>      <dbl>
1  2016-09-09      A     1              4          2
2  2016-06-07      A     0              3          2
3  2016-05-06      A     1              2          1
4  2016-02-03      A     0              1          1
5  2016-02-01      A     1              0          0
6  2016-11-08      B     0              4          3
7  2016-06-08      B     1              3          2
8  2016-05-08      B     1              2          1
9  2016-02-03      B     1              1          0
10 2016-01-01      B     0              0          0
查看更多
Explosion°爆炸
4楼-- · 2019-06-13 02:11

As @Frank and @MarkPeterson point out, the biggest hurdle here is that the Date column is sorted in descending order. Another approach that does not require resorting the Date column:

library(dplyr)
res <- sample_df %>% group_by(ID) %>% 
                     mutate(PrevEnc=n()-row_number(),
                            PrevEvent=rev(cumsum(lag(rev(Event), default=0))))

Here, we use row_number() to determine the row index and n() to determine the number of rows (grouped by ID). Since the Date is sorted in descending order, the number of previous encounter is simply n()-row_number(). To compute the number of previous events, we again exploit the fact that the Date column is sorted in descending order and use rev to reverse the order of the Event column before cumsum the lag of this reversed column. We then use rev again to reverse the result back to the original order.

Using your data:

print(res)
##Source: local data frame [10 x 5]
##Groups: ID [2]
##
##         Date     ID Event PrevEnc PrevEvent
##       <date> <fctr> <dbl>   <int>     <dbl>
##1  2016-09-09      A     1       4         2
##2  2016-06-07      A     0       3         2
##3  2016-05-06      A     1       2         1
##4  2016-02-03      A     0       1         1
##5  2016-02-01      A     1       0         0
##6  2016-11-08      B     0       4         3
##7  2016-06-08      B     1       3         2
##8  2016-05-08      B     1       2         1
##9  2016-02-03      B     1       1         0
##10 2016-01-01      B     0       0         0
查看更多
登录 后发表回答