Ordering and filtering(delete) the rows after a gr

2019-09-21 00:15发布

问题:

I have the following dataset:

USERNAME API_TRACK_EVENT         TIME
userA    Viewed pic              1454941960
userA    Order/payment           1454941972
userA    Edit pic                1454941973
userA    Order/Changed Address   1454941976
userB    Viewed pic              1454941983
userB    Order/guestlogin        1454941986
userB    Order/Changed Address   1454941992

I want to do the following on the dataset:

  • Order the dataset w.r.t TIME
  • Delete all the APITRACK_EVENTs after the first occurrence of an "Order" event (Target is to get all the api tracking events of the user, before a first order is placed).

So, how should I go ahead with this? [Open to using dplyr too.]


A (somewhat)related post

回答1:

After we arrange by 'USERNAME', 'TIME' and grouped by 'USERNAME', we the index of the first occurrence of 'Order' with grepl and which.max. Add 1 to it and get the sequence (:) from it to the nrow (n()). As we need to remove these rows from the dataset, we can use setdiff to find the row index that are not in the created index and slice it.

library(dplyr)
df1 %>% 
  arrange(USERNAME, TIME) %>%
  group_by(USERNAME) %>%
  slice(setdiff(row_number(), (which.max(grepl("Order", 
                                API_TRACK_EVENT))+1): n()))    
#   USERNAME  API_TRACK_EVENT       TIME    
#      <chr>            <chr>      <int>
#1    userA       Viewed pic 1454941960
#2    userA    Order/payment 1454941972
#3    userB       Viewed pic 1454941983
#4    userB Order/guestlogin 1454941986

Another option is with filter

df1 %>%
   arrange(USERNAME, TIME) %>% 
   group_by(USERNAME) %>% 
   filter(!lag(cumsum(grepl("Order", API_TRACK_EVENT)), default = 0))
#   USERNAME  API_TRACK_EVENT       TIME
#     <chr>            <chr>      <int>
#1    userA       Viewed pic 1454941960
#2    userA    Order/payment 1454941972
#3    userB       Viewed pic 1454941983
#4    userB Order/guestlogin 1454941986


标签: r dplyr