here the data
mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "52382МСК", class = "factor"),
item = c(11709L, 11709L, 11709L, 11709L, 11708L, 11708L,
11708L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L), sales = c(30L,
10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L,
3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L,
2L, 10L, 3L), action = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 0L)), .Names = c("code", "item", "sales",
"action"), class = "data.frame", row.names = c(NA, -28L))
I have 3 groups vars code+item. Here 3 groups:
code item
52382МСК 11709
52382МСК 11708
52382МСК 11710
Also i have action column. It can have only two values zero(0) or one(1).
each group represents 3 scenarios
52382МСК 11709
it is scenario when we have 1 zero category of action col. before first category of action col , and two zeros after first category of action col.
Note: maybe scenario when we have 2 zero category of action col. before first category of action col , and 1 zero after first category of action col.
52382МСК 11708
it is scenario when we have 1 zero category of action col. and 1 zeros after first category of action col.
52382МСК 11710
it is scenario when we have 3(or more) zero category of action col. and 3(or more) zeros after first category of action col.
How can i select groups which have each of scenario?
I.e Mydat1
it is groups with first scenario,
Mydat2
it is groups with second scenario,
and Mydat3
it is groups with third scenario
Output is simple
mydat1
code item sales action
52382МСК 11709 30 0
52382МСК 11709 10 1
52382МСК 11709 20 0
52382МСК 11709 15 0
mydat2
code item sales action
52382МСК 11708 2 0
52382МСК 11708 10 1
52382МСК 11708 3 0
mydat3
code item sales action
52382МСК 11710 30 0
52382МСК 11710 10 0
52382МСК 11710 20 0
52382МСК 11710 15 1
52382МСК 11710 2 0
52382МСК 11710 10 0
52382МСК 11710 3 0
52382МСК 11710 30 0
52382МСК 11710 10 0
52382МСК 11710 20 0
52382МСК 11710 15 1
52382МСК 11710 2 0
52382МСК 11710 10 0
52382МСК 11710 3 0
52382МСК 11710 30 0
52382МСК 11710 10 0
52382МСК 11710 20 0
52382МСК 11710 15 1
52382МСК 11710 2 0
52382МСК 11710 10 0
52382МСК 11710 3 0
Edit
I forgot, it can be scenario when we have 1 zero category of action col. before first category of action col , and three zeros after first category of action col.
or maybe scenario when we have 3 zero category of action col. before first category of action col , and 1 zero after first category of action col.
(mydat4)
Also can be
scenario when we have 2 zero category of action col. before first category of action col , and three zeros after first category of action col.
or maybe scenario when we have 3 zero category of action col. before first category of action col , and 2 zero after first category of action col.
(mydat5)
I.E. it must work, only with these scenarios.
edit2
i found next group, it has only one row
code item sales action
52499МСК 11202 2 0
how to do that if data has only one row it would be 6 scenario?
also can be so
code item sales action
52499МСК 11202 2 0
52499МСК 11202 2 1
or
code item sales action
52499МСК 11202 2 0
52499МСК 11202 2 0
or
code item sales action
52499МСК 11202 2 1
52499МСК 11202 2 1
if we have only two rows in group, then 7 scenario
If I understand you correctly, then the scenarios are as follows:
s1: 0100
s2: 010
s3: 000...1000...
s4: 01000 or 00010
s5: 001000 or 000100
s6: 0 or 1
s7: 01 or 00 or 10 or 11
If so, then conveniently each of these scenarios has a unique count of rows. s1 is 4 rows, s2 is 3, s3 is 7+, s4 is 5, s5 is 6, s6 is 1, and s7 is 2.
Using that fact, we can do the following:
library(dplyr)
mydat = structure(list(code = c("52382MCK", "52382MCK", "52382MCK", "52382MCK",
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK",
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK",
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK",
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK"
), item = c(11709L, 11709L, 11709L, 11709L, 11708L, 11708L, 11708L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L), sales = c(30L, 10L,
20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L,
10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L),
action = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L)), class = "data.frame", row.names = c(NA, -28L), .Names = c("code",
"item", "sales", "action"))
mydat = mydat %>%
group_by(code, item) %>%
mutate(groups_item_count = n(),
scenario = case_when(groups_item_count == 4 ~ 1,
groups_item_count == 3 ~ 2,
groups_item_count >= 7 ~ 3,
groups_item_count == 5 ~ 4,
groups_item_count == 6 ~ 5,
groups_item_count == 1 ~ 6,
groups_item_count == 2 ~ 7))
This will add a "scenario" column that will indicate whether it's scenario 1,2,3,4, or 5. I would strongly suggest you don't break up your data frame into multiple data frames. I would be willing to guess that dplyr's filter()
and group_by()
functions would be more efficient for accomplishing whatever it is you want to accomplish next. However, if you're adamant on breaking up into separate data frames for each scenario, then you could do:
mydat1 = filter(mydat, scenario == 1)
mydat2 = filter(mydat, scenario == 2)
mydat3 = filter(mydat, scenario == 3)
mydat4 = filter(mydat, scenario == 4)
mydat5 = filter(mydat, scenario == 5)
mydat6 = filter(mydat, scenario == 6)
mydat7 = filter(mydat, scenario == 7)
If I understand correctly, the OP wants to classify the different groups in his dataset by the pattern of zeros and ones in the action
column. In the question he has described several patterns he expects to find in the data (called "scenarios").
So far, the question was edited two times to add more scenarios. This indicates that OP's list of scenarios might not be complete. Therefore, I suggest a different approach which defines a complete set of scenarios (or patterns of 0
and 1
) and then tries to find these patterns in the dataset.
Define possible scenarios
IIUC, the different scenarios depend on the number of consecutive 0
s before and after each sequence or streak of 1
s. (The sample dataset in this question has only single 1
s but the OP has posted related questions where the data contain streaks of up to three 1
s.)
According to OP's descriptions there can be none 0
or a sequence of up to three consecutive 0
s before or after a streak of 1
s. In addition, a group may contain only 0
s.
With this, a table of all possible patterns / scenarios can be created:
library(data.table)
library(magrittr)
max_zeros <- 3
zeros <- sapply(0:max_zeros, stringr::str_dup, string = "0")
names(zeros) <- as.character(nchar(zeros))
sc <- CJ(zeros.before = zeros, zeros.after = zeros)[
, scenario.name := paste(nchar(zeros.before), nchar(zeros.after), sep = "-")][
, action.pattern := sprintf("%s1+(?=%s)", zeros.before, zeros.after)][]
# special case: all zero
sc0 <- data.table(
zeros.before = NA,
zeros.after = NA,
scenario.name = "no1",
action.pattern = "^0+$")
sc <- rbind(sc0, sc)
sc
zeros.before zeros.after scenario.name action.pattern
1: <NA> <NA> no1 ^0+$
2: 0-0 1+(?=)
3: 0 0-1 1+(?=0)
4: 00 0-2 1+(?=00)
5: 000 0-3 1+(?=000)
6: 0 1-0 01+(?=)
7: 0 0 1-1 01+(?=0)
8: 0 00 1-2 01+(?=00)
9: 0 000 1-3 01+(?=000)
10: 00 2-0 001+(?=)
11: 00 0 2-1 001+(?=0)
12: 00 00 2-2 001+(?=00)
13: 00 000 2-3 001+(?=000)
14: 000 3-0 0001+(?=)
15: 000 0 3-1 0001+(?=0)
16: 000 00 3-2 0001+(?=00)
17: 000 000 3-3 0001+(?=000)
action.pattern
is a regular expression which is used to find the patterns in the character string which is created by collapsing the action
column for each group. It uses lookahead
to find overlapping patterns.
The scenario.name
is a short description of the pattern. For instance, the scenario name 3-3
in row 17 denotes a pattern which consists of three 0
, a streak of one or more 1
s, followed by three 0
, e.g., 0001000
or 00011000
or 00011111000
. Scenario name 0-0
in row 2 denotes another special case where the pattern consists only of 1
s.
Find patterns in the dataset by group
Now, we can try to find the patterns in the given data. To search for different patterns we
- collapse the
action
column into a character string, separately for each code
, item
group,
- count the occurrences of each
action.pattern
in the string, and
- format the result in wide format (one row for each group).
Please, note that mydat
has been modified to include the additional use cases from OP's edits (see Data section below).
mydat[, paste(action, collapse = "") %>%
stringr::str_count(sc$action.pattern) %>%
t() %>%
as.data.table() %>%
setnames(sc$scenario.name),
by = .(code, item)]
code item no1 0-0 0-1 0-2 0-3 1-0 1-1 1-2 1-3 2-0 2-1 2-2 2-3 3-0 3-1 3-2 3-3
1: 52382MCK 11709 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0
2: 52382MCK 11708 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0
3: 52382MCK 11710 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4: 52499MCK 11202 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5: 52499MCK 11203 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
6: 52499MCK 11204 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7: 52499MCK 11205 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This table shows which possible patterns / scenarios are "hidden" in each group and how many occurrences were found.
For instance, row 2 shows that in the respective group the patterns 0-0
("1"
), 0-1
("10"
), 1-0
("01"
), and 1-1
("010
) have been found one times each. Of all found patterns, 1-1
("010
) is the longest, so we will take this as classification for this group.
Row 3 shows that nearly all patterns have been found three times in the respective group. However, pattern 3-3
("0001000"
) is the longest and will be taken as classification for this group.
Automatic classification of groups
We can pick the rightmost pattern as classification of a group automatically:
class <- mydat[, .(scenario.name = sc$scenario.name[
paste(action, collapse = "") %>%
stringr::str_count(sc$action.pattern) %>%
is_greater_than(0) %>%
which() %>%
max()
]),
by = .(code, item)][]
class
code item scenario.name
1: 52382MCK 11709 1-2
2: 52382MCK 11708 1-1
3: 52382MCK 11710 3-3
4: 52499MCK 11202 no1
5: 52499MCK 11203 1-0
6: 52499MCK 11204 no1
7: 52499MCK 11205 0-0
The classification can also by joined to every row of mydat
:
mydat[class, on = .(code, item)]
code item sales action scenario.name
1: 52382MCK 11709 30 0 1-2
2: 52382MCK 11709 10 1 1-2
3: 52382MCK 11709 20 0 1-2
4: 52382MCK 11709 15 0 1-2
5: 52382MCK 11708 2 0 1-1
6: 52382MCK 11708 10 1 1-1
7: 52382MCK 11708 3 0 1-1
8: 52382MCK 11710 30 0 3-3
9: 52382MCK 11710 10 0 3-3
10: 52382MCK 11710 20 0 3-3
11: 52382MCK 11710 15 1 3-3
12: 52382MCK 11710 2 0 3-3
13: 52382MCK 11710 10 0 3-3
14: 52382MCK 11710 3 0 3-3
15: 52382MCK 11710 30 0 3-3
16: 52382MCK 11710 10 0 3-3
17: 52382MCK 11710 20 0 3-3
18: 52382MCK 11710 15 1 3-3
19: 52382MCK 11710 2 0 3-3
20: 52382MCK 11710 10 0 3-3
21: 52382MCK 11710 3 0 3-3
22: 52382MCK 11710 30 0 3-3
23: 52382MCK 11710 10 0 3-3
24: 52382MCK 11710 20 0 3-3
25: 52382MCK 11710 15 1 3-3
26: 52382MCK 11710 2 0 3-3
27: 52382MCK 11710 10 0 3-3
28: 52382MCK 11710 3 0 3-3
29: 52499MCK 11202 2 0 no1
30: 52499MCK 11203 2 0 1-0
31: 52499MCK 11203 2 1 1-0
32: 52499MCK 11204 2 0 no1
33: 52499MCK 11204 2 0 no1
34: 52499MCK 11205 2 1 0-0
35: 52499MCK 11205 2 1 0-0
code item sales action scenario.name
Caveat
The answer is based on my understanding of OP's requirements and includes a lot of speculation and implicit assumptions on OP's production dataset.
Data
Enhanced version of mydat
mydat <-
structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("52382MCK",
"52499MCK"), class = "factor"), item = c(11709L, 11709L, 11709L,
11709L, 11708L, 11708L, 11708L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L,
11710L, 11202L, 11203L, 11203L, 11204L, 11204L, 11205L, 11205L
), sales = c(30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L,
15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L,
20L, 15L, 2L, 10L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), action = c(0L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
1L, 1L)), row.names = c(NA, -35L), class = "data.frame")
# coerce to data.table
setDT(mydat)
EDIT: Splitting by scenarios
The OP has requested to create a separate dataset for each of the 17 scenarios.
In general, I do not recommend to follow this approach unless there are good reasons, additional requirements, or external constraints. Imagine, you were a database administrator. Would you want to create 17 separate tables of identical structure in your database instead of one table where processing can be controlled by selecting subsets?
However, there is the possibility to split()
the dataset by scenario:
split(mydat[class, on = .(code, item)], by = "scenario.name")
$`1-2`
code item sales action scenario.name
1: 52382MCK 11709 30 0 1-2
2: 52382MCK 11709 10 1 1-2
3: 52382MCK 11709 20 0 1-2
4: 52382MCK 11709 15 0 1-2
$`1-1`
code item sales action scenario.name
1: 52382MCK 11708 2 0 1-1
2: 52382MCK 11708 10 1 1-1
3: 52382MCK 11708 3 0 1-1
$`3-3`
code item sales action scenario.name
1: 52382MCK 11710 30 0 3-3
2: 52382MCK 11710 10 0 3-3
3: 52382MCK 11710 20 0 3-3
4: 52382MCK 11710 15 1 3-3
5: 52382MCK 11710 2 0 3-3
6: 52382MCK 11710 10 0 3-3
7: 52382MCK 11710 3 0 3-3
8: 52382MCK 11710 30 0 3-3
9: 52382MCK 11710 10 0 3-3
10: 52382MCK 11710 20 0 3-3
11: 52382MCK 11710 15 1 3-3
12: 52382MCK 11710 2 0 3-3
13: 52382MCK 11710 10 0 3-3
14: 52382MCK 11710 3 0 3-3
15: 52382MCK 11710 30 0 3-3
16: 52382MCK 11710 10 0 3-3
17: 52382MCK 11710 20 0 3-3
18: 52382MCK 11710 15 1 3-3
19: 52382MCK 11710 2 0 3-3
20: 52382MCK 11710 10 0 3-3
21: 52382MCK 11710 3 0 3-3
code item sales action scenario.name
$no1
code item sales action scenario.name
1: 52499MCK 11202 2 0 no1
2: 52499MCK 11204 2 0 no1
3: 52499MCK 11204 2 0 no1
$`1-0`
code item sales action scenario.name
1: 52499MCK 11203 2 0 1-0
2: 52499MCK 11203 2 1 1-0
$`0-0`
code item sales action scenario.name
1: 52499MCK 11205 2 1 0-0
2: 52499MCK 11205 2 1 0-0
The result is a list a data.tables.
Single elements can be picked from the list by
split(mydat[class, on = .(code, item)], by = "scenario.name")[["1-2"]]
code item sales action scenario.name
1: 52382MCK 11709 30 0 1-2
2: 52382MCK 11709 10 1 1-2
3: 52382MCK 11709 20 0 1-2
4: 52382MCK 11709 15 0 1-2
If it is still required for some reason to clutter the workspace with many single data objects this can be done as follows (not recommended)
# workspace before
ls()
[1] "class" "max_zeros" "mydat" "sc" "sc0" "zeros"
# create separate datasets
mydat[class, on = .(code, item)] %>%
split(by = "scenario.name") %>%
set_names(names(.) %>% make.names()) %>% # make syntactically valid names
list2env(.GlobalEnv)
<environment: R_GlobalEnv>
#workspace after
ls()
[1] "class" "max_zeros" "mydat" "no1" "sc" "sc0" "X0.0"
[8] "X1.0" "X1.1" "X1.2" "X3.3" "zeros"