I am looking to look up the individual id
in events_table and calculate the total_duration
as the sum of the duration of all events prior to date
.
The duration is the time between the date_start
and date
(table1), unless the event ended (i.e. has a date_end
), in which case if date_end < date
, duration = date_end - date_start
.
In pseudo code:
IF (date>date_start) Then{
IF(date_end < date & date_end != NA) Then{
duration = date_end-date_start
} else if (date_start < date) {
duration = date - date_start
}
}
Then sum all the durations separately for each "individual_id" and "date" combo
I am using data.tables as I have large tables (>1m rows).
My data looks a bit like this:
table1 <- fread(
"individual id | date
1 | 2019-01-02
1 | 2019-01-03
2 | 2019-01-02
2 | 2019-01-03",
sep ="|"
)
events_table<- fread(
"individual id | date_start | date_end
1 | 2018-01-02 | NA
1 | 2018-01-04 | 2018-07-01
1 | 2018-01-05 | NA
2 | 2018-01-01 | NA
2 | 2018-01-02 | NA
2 | 2018-01-05 | 2018-11-21",
sep = "|"
)
The output should be the following:
table1 <- fread(
"individual id | date | total_duration
1 | 2019-01-02 | 905
1 | 2019-01-03 | 907
2 | 2019-01-02 | 1051
2 | 2019-01-03 | 1053",
sep ="|"
)
My best guess at starting the query comes from:
table1[, total_duration:= events_table[table1,
on = .(`individual id`, date>date_start),
sum(date-date_start),
by = .EACHI][["V1"]]]
But I dont know the syntax for including the if condition.
Thanks for any help.
You don't have to define three distinct columns, though it is easier for debugging. Instead, you could initialize
table1[, v := 0]
and for each step dotable1[, v := v + ...]
.