This is similar to this dplyr lag post, and this dplyr mutate lag post, but neither of those ask this question about defaulting to the input value. I am using dplyr to mutate a new field that's a lagged offset of another field (that I've converted to POSIXct). The goal is, for a given ip, I'd like to know some summary statistics on the delta between all the times it shows up on my list. I also have about 12 million rows.
The data look like this (prior to mutation)
ip hour snap
192.168.1.2 2017070700 0
192.168.1.2 2017070700 15
192.168.1.4 2017070700 0
192.168.1.4 2017070701 45
192.168.1.4 2017070702 30
192.168.1.7 2017070700 15
'hour' is an integer, but should be a timestamp.
'snap' is one of 4 'snapshot' values that represent 15 minute increments.
Here's the data.frame creation code:
test <- data.frame(ip=c("192.168.1.2","192.168.1.2","192.168.1.4","192.168.1.4","192.168.1.4","192.168.1.7"), hour=c(2017070700,2017070700,2017070700,2017070701,2017070702,2017070700), snap=c(0,15,0,45,30,15))
There are hundreds and sometimes thousands of timestamps per ip. The code below uses dplyr to
- a) pad the 0's with a leading 0,
- b) concat the two integer 'date' fields into one field,
- c) convert merged integer 'date' field to a POSIX date,
- d) group by ip,
- e) mutate a new column that's lagged of the old timestamp by 1 and, if the value is NA, refer back to the original value (THIS IS THE BIT THAT DOESN"T WORK), and
- f) mutate a new column that takes the difference of the current time and the previous time (by ip).
These steps refer to the comments at the end of each line.
timedelta <- test %>%
mutate(snap = formatC(snap, width=2, flag=0)) %>% # a)
mutate(fulldateint = paste(hour, snap, sep="")) %>% # b)
mutate(fulldate = as.POSIXct(strptime(fulldateint, "%Y%m%d%H%M"))) %>% # c)
group_by(ip) %>% # d)
mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>% # e)
mutate(diff = fulldate-shifted) # f)
After mutation, the data should look like this:
ip hour snap fulldateint fulldate shifted diff
<fctr> <dbl> <chr> <chr> <dttm> <dttm> <time>
1 192.168.1.2 2017070700 00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00 0 secs
2 192.168.1.2 2017070700 15 201707070015 2017-07-07 00:15:00 2017-07-07 00:00:00 900 secs
3 192.168.1.4 2017070700 00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00 0 secs
4 192.168.1.4 2017070701 45 201707070145 2017-07-07 01:45:00 2017-07-07 00:00:00 6300 secs
5 192.168.1.4 2017070702 30 201707070230 2017-07-07 02:30:00 2017-07-07 01:45:00 2700 secs
6 192.168.1.7 2017070700 15 201707070015 2017-07-07 00:15:00 2017-07-07 00:15:00 0 secs
And if I could get lag to default to its original value, the 'delta-T' would always be 0 when it doesn't have a previous value (which is the desired result).
However, dplyr::lag(fulldate, default=fulldate)
throws the error
Error in mutate_impl(.data, dots) :
Column `shifted` must be length 2 (the group size) or one, not 3
It does work if I use fulldate1, but then I lose the group_by(ip)
result, which is necessary. Is it possible to make lag reference its own input within dplyr?
Note: I really would prefer an answer using dplyr and not data.table, if possible, since I've been using dplyr as our primary data munging library, but also since I'd like to suggest to Mr. Wickham that he take this under consideration if it truly has no solution in the existing dplyr library.