incorrect Rscript work when replacing medians

2019-08-26 11:01发布

问题:

I have dataset

mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = "52382МСК", class = "factor"), item = c(11709L, 
11709L, 11709L, 11709L, 1170L, 1170L, 1170L, 1170L), sales = c(30L, 
10L, 20L, 15L, 8L, 10L, 2L, 15L), action = c(0L, 1L, 0L, 0L, 
0L, 1L, 0L, 0L)), .Names = c("code", "item", "sales", "action"
), class = "data.frame", row.names = c(NA, -8L))

it has two groups by code and item

code    item
52382МСК    11709
52382МСК    1170

Also i have action column. It can have only two values zero(0) or one(1). I need to calculate the median by 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. if median is more than the sales, then do not replace it.

This solution good works if i have three preceding zeros category by action column, i.e. which go before one category of action column, and by three zeros by action column that go after the one category. but if i have 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. it doesn't work correct

replacements <- 
  data_frame(
    action1      = which(mydat$action == 1L),
    group        = rep(1:length(action1), each = 2, length.out = length(action1)),
    sales1       = mydat$sales[action1],
    sales_before = mydat$sales[action1 -1L],
    sales_after  = mydat$sales[action1 +2L]
  ) %>%
  group_by(group) %>%
  mutate(
    med   = median(c(sales_before, sales_after)),
    output = pmin(sales1, med)
  )

mydat$output <- mydat$sales
mydat$output[replacements$action1] <- replacements$output

I get output

   code  item sales action output
1 52382МСК 11709    30      0     30
2 52382МСК 11709    10      1     10
3 52382МСК 11709    20      0     20
4 52382МСК 11709    15      0     15
5 52382МСК  1170     8      0      8
6 52382МСК  1170    10      1     10
7 52382МСК  1170     2      0      2
8 52382МСК  1170    15      0     15

but output should be

   code  item sales action output
1 52382МСК 11709    30      0     30
2 52382МСК 11709    10      1     10
3 52382МСК 11709    20      0     20
4 52382МСК 11709    15      0     15
5 52382МСК  1170     8      0      8
6 52382МСК  1170    10      1     **8**
7 52382МСК  1170     2      0      2
8 52382МСК  1170    15      0     15

how can i get correct output?

edit

   code item sales action
1     a    b     2      0
2     a    b     4      0
3     a    b     3      0
4     a    b    10      1
5     a    b     4      1
6     a    b    10      0
7     a    b     6      0
8     a    b     6      0
9     c    d     2      0
10    c    d     4      0
11    c    d     3      0
12    c    d    10      1
13    c    d    10      0
14    c    d     6      0
15    c    d     6      0

回答1:

The code has several severe flaws:

  • it complete ignores the grouping by code and item
  • it picks only two values for median calculation instead of the full range of zero action rows while the OP had requested to include 1 row before and 2 rows after each action == 1.

If I understand OP's requirements correctly,

  • the OP wants to measure the effect of a sales action by calculating the median sales in a period around each sales action (excluding the sales during the action) and comparing it with the actual sales
  • separately for each product identified by code and item.
  • The length of each sales action can vary (streaks of action == 1)
  • as well as the number of days before and after each action.
  • The expected output is the sales figures on zero action days. On action days, this figure is to be replaced by the median sales of the surrounding zero action days but only if it is less than actual sales figure.

The function below takes three arguments, the dateframe and the number of zero days before and after a sales action. It returns a data.table with the output column appended as defined by the rules above.

sales_action <- function(DF, zeros_before, zeros_after) {
  library(data.table)
  library(magrittr)
  action_pattern <- 
    do.call(sprintf, 
            c(fmt = "%s1+(?=%s)", 
              stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
            ))
  message("Action pattern used: ", action_pattern)
  setDT(DF)[, rn := .I]
  tmp <- DF[, paste(action, collapse = "") %>% 
              stringr::str_locate_all(action_pattern) %>% 
              as.data.table() %>% 
              lapply(function(x) rn[x]),
            by = .(code, item)][
              , end := end + zeros_after]
  DF[tmp, on = .(code, item, rn >= start, rn <= end), 
     med := as.double(median(sales[action == 0])), by = .EACHI][
       , output := as.double(sales)][action == 1, output := pmin(sales, med)][
         , c("rn", "med") := NULL][]
}

For mydat as given by the OP we get

sales_action(mydat, 1L, 2L)
Action pattern used: 01+00
       code  item sales action output
1: 52382MCK 11709    30      0     30
2: 52382MCK 11709    10      1     10
3: 52382MCK 11709    20      0     20
4: 52382MCK 11709    15      0     15
5: 52382MCK  1170     8      0      8
6: 52382MCK  1170    10      1      8
7: 52382MCK  1170     2      0      2
8: 52382MCK  1170    15      0     15

This is in line with OP's expected result.

As a second test case, I have modified the data from OP's edit to include a second action in a one of the groups:

sales_action(mydat2, 1L, 2L)
Action pattern used: 01+00
    code item sales action output
 1:    a    b     2      0      2
 2:    a    b     4      0      4
 3:    a    b     3      0      3
 4:    a    b    10      1      3
 5:    a    b     4      1      3
 6:    a    b     2      0      2
 7:    a    b     4      0      4
 8:    a    b     3      0      3
 9:    a    b    10      1      6
10:    a    b     4      1      4
11:    a    b    10      0     10
12:    a    b     6      0      6
13:    a    b     6      0      6
14:    c    d     2      0      2
15:    c    d     4      0      4
16:    c    d     3      0      3
17:    c    d    10      1      6
18:    c    d    10      0     10
19:    c    d     6      0      6
20:    c    d     6      0      6

The sample includes two actions for the first product, both with a duration of 2 days and one action of 1 day duration for the second product.

For rows 4, 5 the median of the surrounding zero action rows, i.e, median(c(3, 2, 4)) = 3, was taken.

For rows 9, 10, the median of c(3, 10, 6) is 6 which is less than the actual sales in row 9. So, only row 9 was replaced by the median value.

For row 17 the median of c(3, 10, 6) is 6 which replace the actual sales figure in output.

If called for 3 zero action days before and after we get

sales_action(mydat2, 3L, 3L)
Action pattern used: 0001+(?=000)
    code item sales action output
 1:    a    b     2      0      2
 2:    a    b     4      0      4
 3:    a    b     3      0      3
 4:    a    b    10      1      3
 5:    a    b     4      1      3
 6:    a    b     2      0      2
 7:    a    b     4      0      4
 8:    a    b     3      0      3
 9:    a    b    10      1      5
10:    a    b     4      1      4
11:    a    b    10      0     10
12:    a    b     6      0      6
13:    a    b     6      0      6
14:    c    d     2      0      2
15:    c    d     4      0      4
16:    c    d     3      0      3
17:    c    d    10      1      5
18:    c    d    10      0     10
19:    c    d     6      0      6
20:    c    d     6      0      6

Explanation

The key point is to identify which rows belong to the period around each streak of action days. As action consists only of 0 and 1 we can use pattern matching in character strings with an regular expression.

For this, the action column is collapsed into a character string (separately for each code, item group). Then, stringr::str_locate_all() is used to find the start and end positions of the action pattern. action pattern is a regular expression that is looking for any sequence of 1s surrounded by the required number of leading and trailing 0s, resp.

In fact, the regular expression is somewhat more complicated as we have to use lookahead in order to capture overlapping action patterns such as 000111000 in 000111000111000. The end position of the lookahead regex points to the last 1 in each sequence instead of the last 0, so end will be adjusted later on.

Finally, the start and end positions are converted into row locations in DF rather than locations relativ to the group and are returned in tmp.

Now, we do a non-equi join which aggregates and updates DF with an additional med column which contains the median sales of the zero action rows which belong to each start, end range.

The remaining steps are to prepare the output column and to remove the helper columns.

Data

mydat2 <-
structure(list(code = c("a", "a", "a", "a", "a", "a", "a", "a", 
"a", "a", "a", "a", "a", "c", "c", "c", "c", "c", "c", "c"), 
    item = c("b", "b", "b", "b", "b", "b", "b", "b", "b", "b", 
    "b", "b", "b", "d", "d", "d", "d", "d", "d", "d"), sales = c(2L, 
    4L, 3L, 10L, 4L, 2L, 4L, 3L, 10L, 4L, 10L, 6L, 6L, 2L, 4L, 
    3L, 10L, 10L, 6L, 6L), action = c(0L, 0L, 0L, 1L, 1L, 0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA, 
-20L), class = "data.frame")