Variable definition with mutate that depends on it

2019-09-10 03:49发布

问题:

I have some data in the following format:

   time click       interaction
1   407 FALSE              TRUE
2   408  TRUE              TRUE
3   409 FALSE             FALSE
4   410 FALSE             FALSE
5   411 FALSE             FALSE
6   412 FALSE             FALSE
7   413 FALSE             FALSE
8   414 FALSE             FALSE
9   415 FALSE             FALSE
10  416 FALSE             FALSE
11  417 FALSE             FALSE
12  418 FALSE             FALSE
13  419 FALSE             FALSE
14  420 FALSE             FALSE
15  421 FALSE             FALSE
16  422 FALSE             FALSE
17  423 FALSE             FALSE
18  424 FALSE             FALSE
19  425 FALSE             FALSE
20  426 FALSE             FALSE
21  427 FALSE             FALSE
22  428 FALSE             FALSE
23  429 FALSE             FALSE
24  430 FALSE             FALSE
25  431 FALSE             FALSE
26  432 FALSE             FALSE
27  433 FALSE             FALSE
28  434 FALSE             FALSE
29  435 FALSE              TRUE
30  436 FALSE             FALSE

It represents how a user interacts with an application every second (clicks, and other interaction events like typing, scrolling, etc., and interaction is true when there's any interaction, click or otherwise). I'd like to compute a new variable that is true in the span where there's no interaction after clicking until they do start interacting again.

So for this new variable, I want it to be true if there was:

  • A click in the last second and no interaction (click or otherwise) in the current second, OR
  • No interaction after a click in the last second, and there's still no interaction in the current second.

I tried something like this with dplyr:

activity %>% mutate(
    nothing.after.click = (lag(click) == TRUE & interaction == FALSE) |
        (lag(nothing.after.click) == TRUE & interaction == FALSE)
)

but unfortunately it doesn't work (it says "Error: object 'nothing.after.click' not found"). How can I do this? If it isn't possible with dplyr, I would welcome the use of something else.

This is the output I'd like:

   time click       interaction nothing.after.click
1   407 FALSE              TRUE               FALSE
2   408  TRUE              TRUE               FALSE
3   409 FALSE             FALSE                TRUE
4   410 FALSE             FALSE                TRUE
5   411 FALSE             FALSE                TRUE
6   412 FALSE             FALSE                TRUE
7   413 FALSE             FALSE                TRUE
8   414 FALSE             FALSE                TRUE
9   415 FALSE             FALSE                TRUE
10  416 FALSE             FALSE                TRUE
11  417 FALSE             FALSE                TRUE
12  418 FALSE             FALSE                TRUE
13  419 FALSE             FALSE                TRUE
14  420 FALSE             FALSE                TRUE
15  421 FALSE             FALSE                TRUE
16  422 FALSE             FALSE                TRUE
17  423 FALSE             FALSE                TRUE
18  424 FALSE             FALSE                TRUE
19  425 FALSE             FALSE                TRUE
20  426 FALSE             FALSE                TRUE
21  427 FALSE             FALSE                TRUE
22  428 FALSE             FALSE                TRUE
23  429 FALSE             FALSE                TRUE
24  430 FALSE             FALSE                TRUE
25  431 FALSE             FALSE                TRUE
26  432 FALSE             FALSE                TRUE
27  433 FALSE             FALSE                TRUE
28  434 FALSE             FALSE                TRUE
29  435 FALSE              TRUE               FALSE
30  436 FALSE             FALSE               FALSE

Ultimately, the goal is to filter these rows where nothing.after.click is true, so if there's another way to think about this problem I'd welcome that too.

回答1:

You can't reference a variable in it initial definition. What we can do is do it in multiple passes.

When I look at your conditions:

nothing.after.click = (lag(click) == TRUE & interaction == FALSE) |
        (lag(nothing.after.click) == TRUE & interaction == FALSE)

I see that interaction == FALSE in both possibilities. So, if interaction is TRUE, then nothing.after.click (from here on out nac) is definitely FALSE. Otherwise, I'm not sure yet so I'll set it to NA. That's my first pass:

dat %>% mutate(nac = ifelse(interaction, FALSE, NA))

We've taken care of the interaction == FALSE part, the next pass will be the lag(click) == TRUE part of your or clause. For anything that is NA, therefore undecided as yet, it will be TRUE if lag(click) is TRUE, otherwise we'll leave it untouched. (== TRUE is redundant, so I left it out.)

dat %>% mutate(nac = ifelse(interaction, FALSE, NA),
               nac = ifelse(lag(click) & is.na(nac), TRUE, nac))

For the last pass is the lag(nac) part, anything that is still undefined is set to the previous defined value. This is a job for zoo:na.locf (locf stands for "last observation carried forward"):

library(zoo)
dat %>% mutate(nac = ifelse(interaction, FALSE, NA),
               nac = ifelse(lag(click) & is.na(nac), TRUE, nac),
               nac = na.locf(nac))

#    time click interaction   nac
# 1   407 FALSE        TRUE FALSE
# 2   408  TRUE        TRUE FALSE
# 3   409 FALSE       FALSE  TRUE
# 4   410 FALSE       FALSE  TRUE
# 5   411 FALSE       FALSE  TRUE
# 6   412 FALSE       FALSE  TRUE
# 7   413 FALSE       FALSE  TRUE
# 8   414 FALSE       FALSE  TRUE
# 9   415 FALSE       FALSE  TRUE
# 10  416 FALSE       FALSE  TRUE
# 11  417 FALSE       FALSE  TRUE
# 12  418 FALSE       FALSE  TRUE
# 13  419 FALSE       FALSE  TRUE
# 14  420 FALSE       FALSE  TRUE
# 15  421 FALSE       FALSE  TRUE
# 16  422 FALSE       FALSE  TRUE
# 17  423 FALSE       FALSE  TRUE
# 18  424 FALSE       FALSE  TRUE
# 19  425 FALSE       FALSE  TRUE
# 20  426 FALSE       FALSE  TRUE
# 21  427 FALSE       FALSE  TRUE
# 22  428 FALSE       FALSE  TRUE
# 23  429 FALSE       FALSE  TRUE
# 24  430 FALSE       FALSE  TRUE
# 25  431 FALSE       FALSE  TRUE
# 26  432 FALSE       FALSE  TRUE
# 27  433 FALSE       FALSE  TRUE
# 28  434 FALSE       FALSE  TRUE
# 29  435 FALSE        TRUE FALSE
# 30  436 FALSE       FALSE FALSE


回答2:

There is already a good answer(+1) but here is an alternative using base R.

dat$nac <- with(dat, unlist(
    sapply(split(interaction, cumsum(interaction & click)), function(x) c(F, !cumsum(x[-1])))
))

#    time click interaction   nac
# 1   407 FALSE        TRUE FALSE
# 2   408  TRUE        TRUE FALSE
# 3   409 FALSE       FALSE  TRUE
# 4   410 FALSE       FALSE  TRUE
# 5   411 FALSE       FALSE  TRUE
# 6   412 FALSE       FALSE  TRUE
# 7   413 FALSE       FALSE  TRUE
# 8   414 FALSE       FALSE  TRUE
# 9   415 FALSE       FALSE  TRUE
# 10  416 FALSE       FALSE  TRUE
# 11  417 FALSE       FALSE  TRUE
# 12  418 FALSE       FALSE  TRUE
# 13  419 FALSE       FALSE  TRUE
# 14  420 FALSE       FALSE  TRUE
# 15  421 FALSE       FALSE  TRUE
# 16  422 FALSE       FALSE  TRUE
# 17  423 FALSE       FALSE  TRUE
# 18  424 FALSE       FALSE  TRUE
# 19  425 FALSE       FALSE  TRUE
# 20  426 FALSE       FALSE  TRUE
# 21  427 FALSE       FALSE  TRUE
# 22  428 FALSE       FALSE  TRUE
# 23  429 FALSE       FALSE  TRUE
# 24  430 FALSE       FALSE  TRUE
# 25  431 FALSE       FALSE  TRUE
# 26  432 FALSE       FALSE  TRUE
# 27  433 FALSE       FALSE  TRUE
# 28  434 FALSE       FALSE  TRUE
# 29  435 FALSE        TRUE FALSE
# 30  436 FALSE       FALSE FALSE

The trick here is to use the fact that clicks and interactions happen at the same time. Then split the data by those events and use cumsum to find a change in interaction between clicks.



标签: r dplyr