I have the following dataset:
df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,0,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))
library(lubridate)
df$order_date = as_date(df$order_date)
user_id complete_order order_date
1 1 2015-01-28
1 0 2015-01-31
1 0 2015-02-08
1 1 2015-02-23
2 1 2015-01-25
2 0 2015-01-28
2 0 2015-02-06
2 1 2015-02-21
I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:
user_id complete_order order_date complete_order_time_diff
<fctr> <fctr> <date> <time>
1 1 2015-01-28 NA days
1 0 2015-01-31 3 days
1 0 2015-02-08 11 days
1 1 2015-02-23 26 days
2 1 2015-01-25 NA days
2 0 2015-01-28 3 days
2 0 2015-02-06 12 days
2 1 2015-02-21 27 days
when I try this solution:
library(dplyr)
df %>%
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))
it returns the error:
Error: incompatible size (3), expecting 4 (the group size) or 1
Any help with this will be great, thank you!
try this
Update
for multiple cancelled orders
logic
c3
is anid
every time there is an order (i.e.complete_order not 0
) to increment by 1.c1
calculates the day difference buuser_id
(but for non complete orders the result is wrong)c2
fixes this inconsistency ofc1
with respect to non complete orders.hope this clears things.
I would suggest you work with combinations of
group_by()
andmutate(cumsum())
to better understand the results of having more than one grouped variable.I think you can add a
filter
function in place of the subsetting withorder_date[complete_order == 1]
and make sure theorder_date
(and other variables) are the correct data types by addingstringsAsFactors = F
todata.frame()
):This returns the time until the next complete order (and
NA
if there is not one):It seems that you're looking for the distance of each order from the last completed one. Having a binary vector,
x
,c(NA, cummax(x * seq_along(x))[-length(x)])
gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.On your data, first format
df
for convenience:And, then, either apply the above approach after a
group_by
:, or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes: