My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time.
With this data, I would like to create rank variable as shown third column, fourth column, and so on.
The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC"
the rank should be based on all chosenbrand
values within 36 hours before the time. (brand1
in second row should not be in the 36 hours)
Therefore, rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, are my desired variables.
If I want to create rank_brand5, rank_brand6 as well...
Is there any simple way?
In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?
Data is as below,
shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3, ...
2013-09-01 08:35:00 UTC brand1 NA NA NA
2013-09-01 08:54:00 UTC brand1 1 NA NA
2013-09-01 09:07:00 UTC brand2 1 2 NA
2013-09-01 09:08:00 UTC brand3 1 2 3
2013-09-01 09:11:00 UTC brand5 1 2 3
2013-09-01 09:14:00 UTC brand2 1 2 3
2013-09-01 09:26:00 UTC brand6 1 1 3
2013-09-01 09:26:00 UTC brand2 1 1 3
2013-09-01 09:29:00 UTC brand2 2 1 3
2013-09-01 09:32:00 UTC brand4 2 1 3
Here is code for data
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
rank_brand1 = NA,
rank_brand2 = NA,
rank_brand3 = NA,
stringsAsFactors = FALSE)
This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast()
to reshape from long to wide format, and a second join with the original dat
. There can be an arbitrary number of brands.
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
dat, on = "shoptime"]
shoptime brand1 brand2 brand3 brand5 brand6 brand
1: 2013-09-01 08:35:00 NA NA NA NA NA brand1
2: 2013-09-01 08:54:00 1 NA NA NA NA brand1
3: 2013-09-01 09:07:00 1 NA NA NA NA brand2
4: 2013-09-01 09:08:00 1 2 NA NA NA brand3
5: 2013-09-01 09:11:00 1 2 2 NA NA brand5
6: 2013-09-01 09:14:00 1 2 2 2 NA brand2
7: 2013-09-01 09:26:00 1 1 2 2 NA brand6
8: 2013-09-01 09:26:00 1 1 2 2 NA brand2
9: 2013-09-01 09:29:00 2 1 3 3 3 brand2
10: 2013-09-01 09:32:00 2 1 3 3 3 brand4
Explanation
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]
returns the aggregated results per 36 hours periods:
shoptime shoptime brand N rank
1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1 1
2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2 1
3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2 1
4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1 2
5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2 1
6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1 2
7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1 2
8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2 1
9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1 2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1 2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1 2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2 2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3 1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1 3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1 3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1 3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2 2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4 1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1 3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1 3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1 3
shoptime shoptime brand N rank
Then, this intermediate result is reshaped from long to wide format:
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]
shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00 1 NA NA NA NA
2: 2013-09-01 09:07:00 1 NA NA NA NA
3: 2013-09-01 09:08:00 1 2 NA NA NA
4: 2013-09-01 09:11:00 1 2 2 NA NA
5: 2013-09-01 09:14:00 1 2 2 2 NA
6: 2013-09-01 09:26:00 1 1 2 2 NA
7: 2013-09-01 09:29:00 2 1 3 3 3
8: 2013-09-01 09:32:00 2 1 3 3 3
The final right join with the original dat
data frame completes the missing rows and columns (see code and result above).
Data
dat <- data.frame(
shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
The one possibility is to write a function (with loop
) to the job. Considering the data provided in OP:
library(dplyr)
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
rank_brand1 = NA,
rank_brand2 = NA,
rank_brand3 = NA,
stringsAsFactors = FALSE)
#Write a function that data.frame and calculate rank
Calculate.Rank <- function(x){
#loop through each row and calculate count for each brand
for(i in 1:nrow(x)){
#DateTime of the current row.
currentrow.time <- as.POSIXlt(x$shoptime[i])
#calculate number of times brand1 appears
x$rank_brand1[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand1" ))
#calculate number of times brand2 appears
x$rank_brand2[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand2" ))
#calculate number of times brand3 appears
x$rank_brand3[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand3" ))
#Replace the 0 values with NA. I dont think this right approach as one can consider those count to be 0 anyway
if(x$rank_brand1[i] == 0 ){
x$rank_brand1[i] = NA
}
if(x$rank_brand2[i] == 0 ){
x$rank_brand2[i] = NA
}
if(x$rank_brand3[i] == 0 ){
x$rank_brand3[i] = NA
}
}
#Now count of brand1, brand2 and brand3 is available now. Lets calculate rank.
new.x <- data.frame(x[,1:2], t(apply(-x[,3:5], 1, rank, ties.method='min', na.last = "keep")))
print(new.x)
}
Calculate.Rank(dat)
The resulted data.frame new.x
will look like:
shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3
1 2013-09-01 08:35:00 UTC brand1 1 NA NA
2 2013-09-01 08:54:00 UTC brand1 1 NA NA
3 2013-09-01 09:07:00 UTC brand2 1 2 NA
4 2013-09-01 09:08:00 UTC brand3 1 2 2
5 2013-09-01 09:11:00 UTC brand5 1 2 2
6 2013-09-01 09:14:00 UTC brand2 1 1 3
7 2013-09-01 09:26:00 UTC brand6 2 1 3
8 2013-09-01 09:26:00 UTC brand2 2 1 3
9 2013-09-01 09:29:00 UTC brand2 2 1 3
10 2013-09-01 09:32:00 UTC brand4 2 1 3