Can you make dplyr::mutate and dplyr::lag default

2020-03-24 08:48发布

问题:

This is similar to this dplyr lag post, and this dplyr mutate lag post, but neither of those ask this question about defaulting to the input value. I am using dplyr to mutate a new field that's a lagged offset of another field (that I've converted to POSIXct). The goal is, for a given ip, I'd like to know some summary statistics on the delta between all the times it shows up on my list. I also have about 12 million rows.

The data look like this (prior to mutation)

ip             hour         snap
192.168.1.2    2017070700    0
192.168.1.2    2017070700   15
192.168.1.4    2017070700    0
192.168.1.4    2017070701   45
192.168.1.4    2017070702   30
192.168.1.7    2017070700   15

'hour' is an integer, but should be a timestamp.

'snap' is one of 4 'snapshot' values that represent 15 minute increments.

Here's the data.frame creation code:

test <- data.frame(ip=c("192.168.1.2","192.168.1.2","192.168.1.4","192.168.1.4","192.168.1.4","192.168.1.7"), hour=c(2017070700,2017070700,2017070700,2017070701,2017070702,2017070700), snap=c(0,15,0,45,30,15))

There are hundreds and sometimes thousands of timestamps per ip. The code below uses dplyr to

  • a) pad the 0's with a leading 0,
  • b) concat the two integer 'date' fields into one field,
  • c) convert merged integer 'date' field to a POSIX date,
  • d) group by ip,
  • e) mutate a new column that's lagged of the old timestamp by 1 and, if the value is NA, refer back to the original value (THIS IS THE BIT THAT DOESN"T WORK), and
  • f) mutate a new column that takes the difference of the current time and the previous time (by ip).

These steps refer to the comments at the end of each line.

timedelta <- test %>% 
  mutate(snap = formatC(snap, width=2, flag=0)) %>%                      # a) 
  mutate(fulldateint = paste(hour, snap, sep="")) %>%                    # b) 
  mutate(fulldate = as.POSIXct(strptime(fulldateint, "%Y%m%d%H%M"))) %>% # c) 
  group_by(ip) %>%                                                       # d) 
  mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>%           # e) 
  mutate(diff = fulldate-shifted)                                        # f) 

After mutation, the data should look like this:

           ip       hour  snap  fulldateint            fulldate             shifted      diff
       <fctr>      <dbl> <chr>        <chr>              <dttm>              <dttm>    <time>
1 192.168.1.2 2017070700    00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
2 192.168.1.2 2017070700    15 201707070015 2017-07-07 00:15:00 2017-07-07 00:00:00  900 secs
3 192.168.1.4 2017070700    00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
4 192.168.1.4 2017070701    45 201707070145 2017-07-07 01:45:00 2017-07-07 00:00:00 6300 secs
5 192.168.1.4 2017070702    30 201707070230 2017-07-07 02:30:00 2017-07-07 01:45:00 2700 secs
6 192.168.1.7 2017070700    15 201707070015 2017-07-07 00:15:00 2017-07-07 00:15:00    0 secs

And if I could get lag to default to its original value, the 'delta-T' would always be 0 when it doesn't have a previous value (which is the desired result).

However, dplyr::lag(fulldate, default=fulldate) throws the error

Error in mutate_impl(.data, dots) : 
Column `shifted` must be length 2 (the group size) or one, not 3

It does work if I use fulldate1, but then I lose the group_by(ip) result, which is necessary. Is it possible to make lag reference its own input within dplyr?

Note: I really would prefer an answer using dplyr and not data.table, if possible, since I've been using dplyr as our primary data munging library, but also since I'd like to suggest to Mr. Wickham that he take this under consideration if it truly has no solution in the existing dplyr library.

回答1:

In the OP's code ...

...
d) group_by(ip) %>%
e) mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>%
...

The default= argument should have a length of one. Replacing the OP's code with default = first(fulldate) should work in this case (since the first element won't have a lag and so is where we need to apply the default value).

Related cases:

  • Similarly, with a "lead", we'd want dplyr::lead(x, default=last(x)).
  • With a lag or lead of more than one step (n greater than 1), default= cannot do it and we'd probably need to switch to if_else or case_when or similar. (I'm not sure about the current tidyverse idiom.)


回答2:

I think Frank's solution works pretty well. Here is the complete example:


library(dplyr, warn.conflicts = F)

test <- data.frame(ip=c("192.168.1.2","192.168.1.2","192.168.1.4","192.168.1.4","192.168.1.4","192.168.1.7"),
                   hour=c(2017070700,2017070700,2017070700,2017070701,2017070702,2017070700),
                   snap=c(0,15,0,45,30,15))


test %>%
  mutate(snap = formatC(snap, width = 2, flag = 0)) %>%
  mutate(fulldateint = paste(hour, snap, sep = "")) %>%
  mutate(fulldate = as.POSIXct(strptime(fulldateint, "%Y%m%d%H%M"))) %>%
  group_by(ip) %>%
  mutate(shifted = lag(fulldate, default = first(fulldate))) %>%
  mutate(diff = fulldate - shifted) %>% 
  ungroup() %>% 
  select(ip, fulldate, shifted, diff)

#> # A tibble: 6 x 4
#>            ip            fulldate             shifted      diff
#>        <fctr>              <dttm>              <dttm>    <time>
#> 1 192.168.1.2 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
#> 2 192.168.1.2 2017-07-07 00:15:00 2017-07-07 00:00:00  900 secs
#> 3 192.168.1.4 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
#> 4 192.168.1.4 2017-07-07 01:45:00 2017-07-07 00:00:00 6300 secs
#> 5 192.168.1.4 2017-07-07 02:30:00 2017-07-07 01:45:00 2700 secs
#> 6 192.168.1.7 2017-07-07 00:15:00 2017-07-07 00:15:00    0 secs


回答3:

How about

ifelse(is.na(lag(value)), value, lag(value))


标签: r dplyr lag mutate