Calculate elapsed time since last event

2019-02-10 23:02发布

I have a data frame that contains multiple subjects (id), with repeated observations (recorded at times time). Each of the times may or may not be associated with an event (event). An example data frame can be generated with:

set.seed(12345)
id <- c(rep(1, 9), rep(2, 9), rep(3, 9))
time <- c(seq(from = 0, to = 96, by = 12),
      seq(from = 0, to = 80, by = 10),
      seq(from = 0, to = 112, by = 14))
random <- runif(n = 27)
event <- rep(100, 27)

df <- data.frame(cbind(id, time, event, random))
df$event <- ifelse(df$random < 0.55, 0, df$event)
df <- subset(df, select = -c(random))
df$event <- ifelse(df$time == 0, 100, df$event)

I would like to calculate the time between events (tae [time after the last event]), such that the ideal output would look like:

head(ideal_df)
  id time event tae
1  1    0   100   0
2  1   12   100   0
3  1   24   100   0
4  1   36   100   0
5  1   48     0  12
6  1   60     0  24

In fortran, I use the following code to create the tae variable:

IF(EVENT.GT.0) THEN
  TEVENT = TIME
  TAE = 0
ENDIF

IF(EVENT.EQ.0) THEN
  TAE = TIME - TEVENT
ENDIF

In R, I have attempted both an ifelse and dplyr solution. However, neither produce my desired output.

# Calculate the time since last event (using ifelse)
df$tae <- ifelse(df$event >= 0, df$tevent = df$time & df$tae = 0, df$tae = df$time - df$tevent)

Error: unexpected '=' in "df$tae <- ifelse(df$event >= 0, df$tevent ="

# Calculate the time since last event (using dplyr)
res <- df %>%
  arrange(id, time) %>%
  group_by(id) %>%
  mutate(tae = time - lag(time))
res 

   id time event tae
1   1    0   100  NA
2   1   12   100  12
3   1   24   100  12
4   1   36   100  12
5   1   48     0  12
6   1   60     0  12

Clearly, neither of these yield my desired output. It appears as though assigning variables within the ifelse function is not well tolerated by R. My attempt at a dplyr solution also fails to account for the event variable...

Lastly, another variable that recorded the time until the next event tue will be needed. If anyone happens to have a thought regarding how best to go about this (perhaps more tricky) calculation, please feel free to share.

Any thoughts regarding how to get one of these working (or an alternative solution) would be greatly appreciated. Thanks!

P.S. -- A reproducible example when the interval between events changes within an ID is presented below:

id <- rep(1, 9)
time <- c(0, 10, 22, 33, 45, 57, 66, 79, 92)
event <- c(100, 0, 0, 100, 0, 100, 0, 0, 100)
df <- data.frame(cbind(id, time, event))

head(df)
  id time event
1  1    0   100
2  1   10     0
3  1   22     0
4  1   33   100
5  1   45     0
6  1   57   100

4条回答
啃猪蹄的小仙女
2楼-- · 2019-02-10 23:39

I guess you might be impressed by the compactness of dplyr, but going through a lot of unnecessary calculations really hurts your time performance...

> loopfun <- function(df){
+ 
+   event <- (df$event == 100)
+   lasttime <- 0
+ 
+   time <- df$time
+   tae <- rep(0, nrow(df))
+ 
+   for(i in 1:nrow(df)){
+ 
+     if(event[i]){
+ 
+       lasttime <- time[i]
+ 
+     }else{
+ 
+       tae[i] <- time[i] - lasttime
+ 
+     }
+ 
+   }
+ 
+   df$tae <- tae
+ 
+   return(df)
+ }
> 
> dplyrfun <- function(df){
+   
+   return(df %>%
+     mutate(tmp = c(0, diff(time)) * !event,
+            tmp2 = cumsum(c(FALSE, as.logical(diff(event))))) %>%
+     group_by(tmp2) %>%
+     mutate(tae = cumsum(tmp)) %>%
+     select(-tmp, -tmp2)
+   )
+   
+ }
> 
> microbenchmark(loopfun(df), dplyrfun(df), times = 10000)
Unit: microseconds
         expr      min       lq       mean   median       uq      max neval
  loopfun(df)   57.356   70.035   95.89365   82.109   96.599 49001.19 10000
 dplyrfun(df) 1494.564 1625.274 1875.85263 1705.722 1877.336 50087.32 10000
查看更多
叼着烟拽天下
3楼-- · 2019-02-10 23:41

Here's an approach with dplyr:

library(dplyr)
df %>%
  mutate(tmpG = cumsum(c(FALSE, as.logical(diff(event))))) %>%
  group_by(id) %>%
  mutate(tmp_a = c(0, diff(time)) * !event,
         tmp_b = c(diff(time), 0) * !event) %>%
  group_by(tmpG) %>%
  mutate(tae = cumsum(tmp_a),
         tbe = rev(cumsum(rev(tmp_b)))) %>%
  ungroup() %>%
  select(-c(tmp_a, tmp_b, tmpG))

The new columns include time after event (tae) and time before event (tbe).

The result:

   id time event tae tbe
1   1    0   100   0   0
2   1   12   100   0   0
3   1   24   100   0   0
4   1   36   100   0   0
5   1   48     0  12  48
6   1   60     0  24  36
7   1   72     0  36  24
8   1   84     0  48  12
9   1   96   100   0   0
10  2    0   100   0   0
11  2   12     0  12  24
12  2   24     0  24  12
13  2   36   100   0   0
14  2   48     0  12  48
15  2   60     0  24  36
16  2   72     0  36  24
17  2   84     0  48  12
18  2   96     0  60   0
19  3    0   100   0   0
20  3   12   100   0   0
21  3   24     0  12  24
22  3   36     0  24  12
23  3   48   100   0   0
24  3   60   100   0   0
25  3   72   100   0   0
26  3   84     0  12  12
27  3   96   100   0   0

The result with the second example:

  id time event tae tbe
1  1    0   100   0   0
2  1   10     0  10  23
3  1   22     0  22  11
4  1   33   100   0   0
5  1   45     0  12  12
6  1   57   100   0   0
7  1   66     0   9  26
8  1   79     0  22  13
9  1   92   100   0   0
查看更多
我命由我不由天
4楼-- · 2019-02-10 23:41

You were very close with your dplyr implementation. Try this

df %>%
  arrange(id, time) %>%
  group_by(id) %>%
  mutate(tae = cumsum(event==0)*12)
查看更多
\"骚年 ilove
5楼-- · 2019-02-10 23:41

I can't think of a way to vectorize it right now, but here's a loop that should be decently quick (O(n)).

event <- (df$event == 100)
lasttime <- 0

time <- df$time
tae <- rep(0, nrow(df))

for(i in 1:nrow(df)){

    if(event[i]){

        lasttime <- time[i]

    }else{

        tae[i] <- time[i] - lasttime

    }

}

df$tae <- tae
查看更多
登录 后发表回答