I have dataset
mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "52382МСК", class = "factor"), item = c(11709L,
11709L, 11709L, 11709L, 1170L, 1170L, 1170L, 1170L), sales = c(30L,
10L, 20L, 15L, 8L, 10L, 2L, 15L), action = c(0L, 1L, 0L, 0L,
0L, 1L, 0L, 0L)), .Names = c("code", "item", "sales", "action"
), class = "data.frame", row.names = c(NA, -8L))
it has two groups by code and item
code item
52382МСК 11709
52382МСК 1170
Also i have action column. It can have only two values zero(0) or one(1). I need to calculate the median by 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category.
if median is more than the sales, then do not replace it.
This solution good works if i have three preceding zeros category by action column, i.e. which go before one category of action column, and by three zeros by action column that go after the one category.
but if i have 1 preceding zeros category by action column, i.e. which go before one category of action column, and by 2 zeros by action column that go after the one category. it doesn't work correct
replacements <-
data_frame(
action1 = which(mydat$action == 1L),
group = rep(1:length(action1), each = 2, length.out = length(action1)),
sales1 = mydat$sales[action1],
sales_before = mydat$sales[action1 -1L],
sales_after = mydat$sales[action1 +2L]
) %>%
group_by(group) %>%
mutate(
med = median(c(sales_before, sales_after)),
output = pmin(sales1, med)
)
mydat$output <- mydat$sales
mydat$output[replacements$action1] <- replacements$output
I get output
code item sales action output
1 52382МСК 11709 30 0 30
2 52382МСК 11709 10 1 10
3 52382МСК 11709 20 0 20
4 52382МСК 11709 15 0 15
5 52382МСК 1170 8 0 8
6 52382МСК 1170 10 1 10
7 52382МСК 1170 2 0 2
8 52382МСК 1170 15 0 15
but output should be
code item sales action output
1 52382МСК 11709 30 0 30
2 52382МСК 11709 10 1 10
3 52382МСК 11709 20 0 20
4 52382МСК 11709 15 0 15
5 52382МСК 1170 8 0 8
6 52382МСК 1170 10 1 **8**
7 52382МСК 1170 2 0 2
8 52382МСК 1170 15 0 15
how can i get correct output?
edit
code item sales action
1 a b 2 0
2 a b 4 0
3 a b 3 0
4 a b 10 1
5 a b 4 1
6 a b 10 0
7 a b 6 0
8 a b 6 0
9 c d 2 0
10 c d 4 0
11 c d 3 0
12 c d 10 1
13 c d 10 0
14 c d 6 0
15 c d 6 0
The code has several severe flaws:
- it complete ignores the grouping by
code
and item
- it picks only two values for median calculation instead of the full range of zero action rows while the OP had requested to include 1 row before and 2 rows after each
action == 1
.
If I understand OP's requirements correctly,
- the OP wants to measure the effect of a sales action by calculating the median sales in a period around each sales action (excluding the sales during the action) and comparing it with the actual sales
- separately for each product identified by
code
and item
.
- The length of each sales action can vary (streaks of
action == 1
)
- as well as the number of days before and after each action.
- The expected output is the sales figures on zero action days. On action days, this figure is to be replaced by the median sales of the surrounding zero action days but only if it is less than actual sales figure.
The function below takes three arguments, the dateframe and the number of zero days before and after a sales action. It returns a data.table with the output
column appended as defined by the rules above.
sales_action <- function(DF, zeros_before, zeros_after) {
library(data.table)
library(magrittr)
action_pattern <-
do.call(sprintf,
c(fmt = "%s1+(?=%s)",
stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
))
message("Action pattern used: ", action_pattern)
setDT(DF)[, rn := .I]
tmp <- DF[, paste(action, collapse = "") %>%
stringr::str_locate_all(action_pattern) %>%
as.data.table() %>%
lapply(function(x) rn[x]),
by = .(code, item)][
, end := end + zeros_after]
DF[tmp, on = .(code, item, rn >= start, rn <= end),
med := as.double(median(sales[action == 0])), by = .EACHI][
, output := as.double(sales)][action == 1, output := pmin(sales, med)][
, c("rn", "med") := NULL][]
}
For mydat
as given by the OP we get
sales_action(mydat, 1L, 2L)
Action pattern used: 01+00
code item sales action output
1: 52382MCK 11709 30 0 30
2: 52382MCK 11709 10 1 10
3: 52382MCK 11709 20 0 20
4: 52382MCK 11709 15 0 15
5: 52382MCK 1170 8 0 8
6: 52382MCK 1170 10 1 8
7: 52382MCK 1170 2 0 2
8: 52382MCK 1170 15 0 15
This is in line with OP's expected result.
As a second test case, I have modified the data from OP's edit to include a second action in a one of the groups:
sales_action(mydat2, 1L, 2L)
Action pattern used: 01+00
code item sales action output
1: a b 2 0 2
2: a b 4 0 4
3: a b 3 0 3
4: a b 10 1 3
5: a b 4 1 3
6: a b 2 0 2
7: a b 4 0 4
8: a b 3 0 3
9: a b 10 1 6
10: a b 4 1 4
11: a b 10 0 10
12: a b 6 0 6
13: a b 6 0 6
14: c d 2 0 2
15: c d 4 0 4
16: c d 3 0 3
17: c d 10 1 6
18: c d 10 0 10
19: c d 6 0 6
20: c d 6 0 6
The sample includes two actions for the first product, both with a duration of 2 days and one action of 1 day duration for the second product.
For rows 4, 5 the median of the surrounding zero action rows, i.e, median(c(3, 2, 4))
= 3, was taken.
For rows 9, 10, the median of c(3, 10, 6) is 6 which is less than the actual sales in row 9. So, only row 9 was replaced by the median value.
For row 17 the median of c(3, 10, 6) is 6 which replace the actual sales figure in output
.
If called for 3 zero action days before and after we get
sales_action(mydat2, 3L, 3L)
Action pattern used: 0001+(?=000)
code item sales action output
1: a b 2 0 2
2: a b 4 0 4
3: a b 3 0 3
4: a b 10 1 3
5: a b 4 1 3
6: a b 2 0 2
7: a b 4 0 4
8: a b 3 0 3
9: a b 10 1 5
10: a b 4 1 4
11: a b 10 0 10
12: a b 6 0 6
13: a b 6 0 6
14: c d 2 0 2
15: c d 4 0 4
16: c d 3 0 3
17: c d 10 1 5
18: c d 10 0 10
19: c d 6 0 6
20: c d 6 0 6
Explanation
The key point is to identify which rows belong to the period around each streak of action days. As action
consists only of 0
and 1
we can use pattern matching in character strings with an regular expression.
For this, the action
column is collapsed into a character string (separately for each code
, item
group). Then, stringr::str_locate_all()
is used to find the start and end positions of the action pattern
. action pattern
is a regular expression that is looking for any sequence of 1
s surrounded by the required number of leading and trailing 0
s, resp.
In fact, the regular expression is somewhat more complicated as we have to use lookahead in order to capture overlapping action patterns such as 000111000
in 000111000111000
. The end
position of the lookahead regex points to the last 1
in each sequence instead of the last 0
, so end
will be adjusted later on.
Finally, the start and end positions are converted into row locations in DF
rather than locations relativ to the group and are returned in tmp
.
Now, we do a non-equi join which aggregates and updates DF
with an additional med
column which contains the median sales of the zero action rows which belong to each start
, end
range.
The remaining steps are to prepare the output
column and to remove the helper columns.
Data
mydat2 <-
structure(list(code = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "c", "c", "c", "c", "c", "c", "c"),
item = c("b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "d", "d", "d", "d", "d", "d", "d"), sales = c(2L,
4L, 3L, 10L, 4L, 2L, 4L, 3L, 10L, 4L, 10L, 6L, 6L, 2L, 4L,
3L, 10L, 10L, 6L, 6L), action = c(0L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA,
-20L), class = "data.frame")