I have the following dataset:
USERNAME API_TRACK_EVENT TIME
userA Viewed pic 1454941960
userA Order/payment 1454941972
userA Edit pic 1454941973
userA Order/Changed Address 1454941976
userB Viewed pic 1454941983
userB Order/guestlogin 1454941986
userB Order/Changed Address 1454941992
I want to do the following on the dataset:
- Order the dataset w.r.t
TIME
- Delete all the
APITRACK_EVENT
s after the first occurrence of an "Order" event (Target is to get all the api tracking events of the user, before a first order is placed).
So, how should I go ahead with this? [Open to using dplyr too.]
A (somewhat)related post
After we arrange
by 'USERNAME', 'TIME' and grouped by 'USERNAME', we the index of the first occurrence of 'Order' with grepl
and which.max
. Add 1 to it and get the sequence (:
) from it to the nrow (n()
). As we need to remove these rows from the dataset, we can use setdiff
to find the row index that are not in the created index and slice
it.
library(dplyr)
df1 %>%
arrange(USERNAME, TIME) %>%
group_by(USERNAME) %>%
slice(setdiff(row_number(), (which.max(grepl("Order",
API_TRACK_EVENT))+1): n()))
# USERNAME API_TRACK_EVENT TIME
# <chr> <chr> <int>
#1 userA Viewed pic 1454941960
#2 userA Order/payment 1454941972
#3 userB Viewed pic 1454941983
#4 userB Order/guestlogin 1454941986
Another option is with filter
df1 %>%
arrange(USERNAME, TIME) %>%
group_by(USERNAME) %>%
filter(!lag(cumsum(grepl("Order", API_TRACK_EVENT)), default = 0))
# USERNAME API_TRACK_EVENT TIME
# <chr> <chr> <int>
#1 userA Viewed pic 1454941960
#2 userA Order/payment 1454941972
#3 userB Viewed pic 1454941983
#4 userB Order/guestlogin 1454941986