How to create a rank variable under certain condit

My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time.

With this data, I would like to create rank variable as shown third column, fourth column, and so on.

The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC" the rank should be based on all chosenbrand values within 36 hours before the time. (brand1 in second row should not be in the 36 hours)

Therefore, rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, are my desired variables.

If I want to create rank_brand5, rank_brand6 as well...

Is there any simple way?

In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?

Data is as below,

          shoptime          chosenbrand  rank_brand1 rank_brand2 rank_brand3, ...
  2013-09-01 08:35:00 UTC      brand1          NA         NA          NA
  2013-09-01 08:54:00 UTC      brand1          1          NA          NA
  2013-09-01 09:07:00 UTC      brand2          1          2          NA
  2013-09-01 09:08:00 UTC      brand3          1          2          3
  2013-09-01 09:11:00 UTC      brand5          1          2          3
  2013-09-01 09:14:00 UTC      brand2          1          2          3
  2013-09-01 09:26:00 UTC      brand6          1          1          3
  2013-09-01 09:26:00 UTC      brand2          1          1          3
  2013-09-01 09:29:00 UTC      brand2          2          1          3
  2013-09-01 09:32:00 UTC      brand4          2          1          3

Here is code for data

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
                           "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
                  chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
                  rank_brand1 = NA,
                  rank_brand2 = NA,
                 rank_brand3 = NA,
                  stringsAsFactors = FALSE)

标签： r dataframe dplyr data.table plyr

2条回答

淡お忘

2楼-- · 2019-01-25 21:28

This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast() to reshape from long to wide format, and a second join with the original dat. There can be an arbitrary number of brands.

library(data.table)
library(lubridate)

setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
        dat, on = "shoptime"]

               shoptime brand1 brand2 brand3 brand5 brand6  brand
 1: 2013-09-01 08:35:00     NA     NA     NA     NA     NA brand1
 2: 2013-09-01 08:54:00      1     NA     NA     NA     NA brand1
 3: 2013-09-01 09:07:00      1     NA     NA     NA     NA brand2
 4: 2013-09-01 09:08:00      1      2     NA     NA     NA brand3
 5: 2013-09-01 09:11:00      1      2      2     NA     NA brand5
 6: 2013-09-01 09:14:00      1      2      2      2     NA brand2
 7: 2013-09-01 09:26:00      1      1      2      2     NA brand6
 8: 2013-09-01 09:26:00      1      1      2      2     NA brand2
 9: 2013-09-01 09:29:00      2      1      3      3      3 brand2
10: 2013-09-01 09:32:00      2      1      3      3      3 brand4

Explanation

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]

returns the aggregated results per 36 hours periods:

               shoptime            shoptime  brand N rank
 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1    1
 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2    1
 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2    1
 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1    2
 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2    1
 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1    2
 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1    2
 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2    1
 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1    2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1    2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1    2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2    2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3    1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1    3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1    3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1    3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2    2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4    1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1    3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1    3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1    3
               shoptime            shoptime  brand N rank

Then, this intermediate result is reshaped from long to wide format:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]

              shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00      1     NA     NA     NA     NA
2: 2013-09-01 09:07:00      1     NA     NA     NA     NA
3: 2013-09-01 09:08:00      1      2     NA     NA     NA
4: 2013-09-01 09:11:00      1      2      2     NA     NA
5: 2013-09-01 09:14:00      1      2      2      2     NA
6: 2013-09-01 09:26:00      1      1      2      2     NA
7: 2013-09-01 09:29:00      2      1      3      3      3
8: 2013-09-01 09:32:00      2      1      3      3      3

The final right join with the original dat data frame completes the missing rows and columns (see code and result above).

Data

dat <- data.frame(
  shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
  brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
  stringsAsFactors = FALSE)

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2019-01-25 21:43

The one possibility is to write a function (with loop) to the job. Considering the data provided in OP:

library(dplyr)

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
                               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
                  chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
                  rank_brand1 = NA,
                  rank_brand2 = NA,
                  rank_brand3 = NA,
                  stringsAsFactors = FALSE)

#Write a function that data.frame and calculate rank
Calculate.Rank <- function(x){
  #loop through each row and calculate count for each brand 
  for(i in 1:nrow(x)){
    #DateTime of the current row. 
    currentrow.time <- as.POSIXlt(x$shoptime[i])
    #calculate number of times brand1 appears
    x$rank_brand1[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand1" ))
    #calculate number of times brand2 appears
    x$rank_brand2[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand2" ))    
    #calculate number of times brand3 appears
    x$rank_brand3[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand3" ))

#Replace the 0 values with NA. I dont think this right approach as one can consider those count to be 0 anyway

    if(x$rank_brand1[i] == 0 ){
      x$rank_brand1[i] = NA
    }

    if(x$rank_brand2[i] == 0 ){
      x$rank_brand2[i] = NA
    }
    if(x$rank_brand3[i] == 0 ){
      x$rank_brand3[i] = NA
    }    

  }

  #Now count of brand1, brand2 and brand3 is available now. Lets calculate rank.
  new.x <- data.frame(x[,1:2], t(apply(-x[,3:5], 1, rank, ties.method='min', na.last = "keep")))

  print(new.x)
}

Calculate.Rank(dat)

The resulted data.frame new.x will look like:

                shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3
1  2013-09-01 08:35:00 UTC      brand1           1          NA          NA
2  2013-09-01 08:54:00 UTC      brand1           1          NA          NA
3  2013-09-01 09:07:00 UTC      brand2           1           2          NA
4  2013-09-01 09:08:00 UTC      brand3           1           2           2
5  2013-09-01 09:11:00 UTC      brand5           1           2           2
6  2013-09-01 09:14:00 UTC      brand2           1           1           3
7  2013-09-01 09:26:00 UTC      brand6           2           1           3
8  2013-09-01 09:26:00 UTC      brand2           2           1           3
9  2013-09-01 09:29:00 UTC      brand2           2           1           3
10 2013-09-01 09:32:00 UTC      brand4           2           1           3

0人赞添加讨论(0) 举报