selecting groups by its scenario in R

2019-08-15 02:50发布

问题:

here the data

mydat=structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "52382МСК", class = "factor"), 
    item = c(11709L, 11709L, 11709L, 11709L, 11708L, 11708L, 
    11708L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
    11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
    11710L, 11710L, 11710L, 11710L, 11710L, 11710L), sales = c(30L, 
    10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 
    3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 
    2L, 10L, 3L), action = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 
    0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
    0L, 1L, 0L, 0L, 0L)), .Names = c("code", "item", "sales", 
"action"), class = "data.frame", row.names = c(NA, -28L))

I have 3 groups vars code+item. Here 3 groups:

code    item
52382МСК    11709
52382МСК    11708
52382МСК    11710

Also i have action column. It can have only two values zero(0) or one(1).

each group represents 3 scenarios

52382МСК    11709

it is scenario when we have 1 zero category of action col. before first category of action col , and two zeros after first category of action col. Note: maybe scenario when we have 2 zero category of action col. before first category of action col , and 1 zero after first category of action col.

52382МСК    11708

it is scenario when we have 1 zero category of action col. and 1 zeros after first category of action col.

52382МСК    11710

it is scenario when we have 3(or more) zero category of action col. and 3(or more) zeros after first category of action col.

How can i select groups which have each of scenario? I.e Mydat1 it is groups with first scenario, Mydat2 it is groups with second scenario, and Mydat3 it is groups with third scenario

Output is simple

mydat1

code    item    sales   action
52382МСК    11709   30  0
52382МСК    11709   10  1
52382МСК    11709   20  0
52382МСК    11709   15  0



mydat2
code    item    sales   action
52382МСК    11708   2   0
52382МСК    11708   10  1
52382МСК    11708   3   0

mydat3
code    item    sales   action
52382МСК    11710   30  0
52382МСК    11710   10  0
52382МСК    11710   20  0
52382МСК    11710   15  1
52382МСК    11710   2   0
52382МСК    11710   10  0
52382МСК    11710   3   0
52382МСК    11710   30  0
52382МСК    11710   10  0
52382МСК    11710   20  0
52382МСК    11710   15  1
52382МСК    11710   2   0
52382МСК    11710   10  0
52382МСК    11710   3   0
52382МСК    11710   30  0
52382МСК    11710   10  0
52382МСК    11710   20  0
52382МСК    11710   15  1
52382МСК    11710   2   0
52382МСК    11710   10  0
52382МСК    11710   3   0

Edit

I forgot, it can be scenario when we have 1 zero category of action col. before first category of action col , and three zeros after first category of action col. or maybe scenario when we have 3 zero category of action col. before first category of action col , and 1 zero after first category of action col.

(mydat4)

Also can be scenario when we have 2 zero category of action col. before first category of action col , and three zeros after first category of action col. or maybe scenario when we have 3 zero category of action col. before first category of action col , and 2 zero after first category of action col.

(mydat5)

I.E. it must work, only with these scenarios.

edit2

i found next group, it has only one row

code    item    sales   action
52499МСК    11202   2   0

how to do that if data has only one row it would be 6 scenario?

also can be so

code    item    sales   action
52499МСК    11202   2   0
 52499МСК   11202   2   1

or

code    item    sales   action
52499МСК    11202   2   0
 52499МСК   11202   2   0

or

code    item    sales   action
52499МСК    11202   2   1
 52499МСК   11202   2   1

if we have only two rows in group, then 7 scenario

回答1:

If I understand you correctly, then the scenarios are as follows:

s1: 0100
s2: 010
s3: 000...1000...
s4: 01000 or 00010
s5: 001000 or 000100
s6: 0 or 1
s7: 01 or 00 or 10 or 11

If so, then conveniently each of these scenarios has a unique count of rows. s1 is 4 rows, s2 is 3, s3 is 7+, s4 is 5, s5 is 6, s6 is 1, and s7 is 2.

Using that fact, we can do the following:

library(dplyr)

mydat = structure(list(code = c("52382MCK", "52382MCK", "52382MCK", "52382MCK", 
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", 
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", 
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", 
"52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK", "52382MCK"
), item = c(11709L, 11709L, 11709L, 11709L, 11708L, 11708L, 11708L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L), sales = c(30L, 10L, 
20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 
10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L), 
    action = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
    0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
    0L)), class = "data.frame", row.names = c(NA, -28L), .Names = c("code", 
"item", "sales", "action"))

mydat = mydat %>%
  group_by(code, item) %>%
  mutate(groups_item_count = n(),
         scenario = case_when(groups_item_count == 4 ~ 1,
                              groups_item_count == 3 ~ 2,
                              groups_item_count >= 7 ~ 3,
                              groups_item_count == 5 ~ 4,
                              groups_item_count == 6 ~ 5,
                              groups_item_count == 1 ~ 6,
                              groups_item_count == 2 ~ 7))

This will add a "scenario" column that will indicate whether it's scenario 1,2,3,4, or 5. I would strongly suggest you don't break up your data frame into multiple data frames. I would be willing to guess that dplyr's filter() and group_by() functions would be more efficient for accomplishing whatever it is you want to accomplish next. However, if you're adamant on breaking up into separate data frames for each scenario, then you could do:

mydat1 = filter(mydat, scenario == 1)
mydat2 = filter(mydat, scenario == 2)
mydat3 = filter(mydat, scenario == 3)
mydat4 = filter(mydat, scenario == 4)
mydat5 = filter(mydat, scenario == 5)
mydat6 = filter(mydat, scenario == 6)
mydat7 = filter(mydat, scenario == 7)


回答2:

If I understand correctly, the OP wants to classify the different groups in his dataset by the pattern of zeros and ones in the action column. In the question he has described several patterns he expects to find in the data (called "scenarios").

So far, the question was edited two times to add more scenarios. This indicates that OP's list of scenarios might not be complete. Therefore, I suggest a different approach which defines a complete set of scenarios (or patterns of 0 and 1) and then tries to find these patterns in the dataset.

Define possible scenarios

IIUC, the different scenarios depend on the number of consecutive 0s before and after each sequence or streak of 1s. (The sample dataset in this question has only single 1s but the OP has posted related questions where the data contain streaks of up to three 1s.)

According to OP's descriptions there can be none 0 or a sequence of up to three consecutive 0s before or after a streak of 1s. In addition, a group may contain only 0s.

With this, a table of all possible patterns / scenarios can be created:

library(data.table)
library(magrittr)

max_zeros <- 3
zeros <- sapply(0:max_zeros, stringr::str_dup, string = "0")
names(zeros) <- as.character(nchar(zeros))
sc <- CJ(zeros.before = zeros, zeros.after = zeros)[
  , scenario.name := paste(nchar(zeros.before), nchar(zeros.after), sep = "-")][
    , action.pattern := sprintf("%s1+(?=%s)", zeros.before, zeros.after)][]
# special case: all zero
sc0 <- data.table(
  zeros.before = NA,
  zeros.after = NA, 
  scenario.name = "no1", 
  action.pattern = "^0+$")
sc <- rbind(sc0, sc)
sc
    zeros.before zeros.after scenario.name action.pattern
 1:         <NA>        <NA>           no1           ^0+$
 2:                                    0-0         1+(?=)
 3:                        0           0-1        1+(?=0)
 4:                       00           0-2       1+(?=00)
 5:                      000           0-3      1+(?=000)
 6:            0                       1-0        01+(?=)
 7:            0           0           1-1       01+(?=0)
 8:            0          00           1-2      01+(?=00)
 9:            0         000           1-3     01+(?=000)
10:           00                       2-0       001+(?=)
11:           00           0           2-1      001+(?=0)
12:           00          00           2-2     001+(?=00)
13:           00         000           2-3    001+(?=000)
14:          000                       3-0      0001+(?=)
15:          000           0           3-1     0001+(?=0)
16:          000          00           3-2    0001+(?=00)
17:          000         000           3-3   0001+(?=000)

action.pattern is a regular expression which is used to find the patterns in the character string which is created by collapsing the action column for each group. It uses lookahead to find overlapping patterns.

The scenario.name is a short description of the pattern. For instance, the scenario name 3-3 in row 17 denotes a pattern which consists of three 0, a streak of one or more 1s, followed by three 0, e.g., 0001000 or 00011000 or 00011111000. Scenario name 0-0 in row 2 denotes another special case where the pattern consists only of 1s.

Find patterns in the dataset by group

Now, we can try to find the patterns in the given data. To search for different patterns we

  • collapse the action column into a character string, separately for each code, item group,
  • count the occurrences of each action.pattern in the string, and
  • format the result in wide format (one row for each group).

Please, note that mydat has been modified to include the additional use cases from OP's edits (see Data section below).

mydat[, paste(action, collapse = "") %>% 
        stringr::str_count(sc$action.pattern) %>%
        t() %>% 
        as.data.table() %>% 
        setnames(sc$scenario.name),
      by = .(code, item)]

       code  item no1 0-0 0-1 0-2 0-3 1-0 1-1 1-2 1-3 2-0 2-1 2-2 2-3 3-0 3-1 3-2 3-3
1: 52382MCK 11709   0   1   1   1   0   1   1   1   0   0   0   0   0   0   0   0   0
2: 52382MCK 11708   0   1   1   0   0   1   1   0   0   0   0   0   0   0   0   0   0
3: 52382MCK 11710   0   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3
4: 52499MCK 11202   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
5: 52499MCK 11203   0   1   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
6: 52499MCK 11204   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
7: 52499MCK 11205   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

This table shows which possible patterns / scenarios are "hidden" in each group and how many occurrences were found.

For instance, row 2 shows that in the respective group the patterns 0-0 ("1"), 0-1 ("10"), 1-0 ("01"), and 1-1 ("010) have been found one times each. Of all found patterns, 1-1 ("010) is the longest, so we will take this as classification for this group.

Row 3 shows that nearly all patterns have been found three times in the respective group. However, pattern 3-3 ("0001000") is the longest and will be taken as classification for this group.

Automatic classification of groups

We can pick the rightmost pattern as classification of a group automatically:

class <- mydat[, .(scenario.name = sc$scenario.name[
  paste(action, collapse = "") %>% 
    stringr::str_count(sc$action.pattern) %>%
    is_greater_than(0) %>% 
    which() %>% 
    max()
  ]),
  by = .(code, item)][]

class
       code  item scenario.name
1: 52382MCK 11709           1-2
2: 52382MCK 11708           1-1
3: 52382MCK 11710           3-3
4: 52499MCK 11202           no1
5: 52499MCK 11203           1-0
6: 52499MCK 11204           no1
7: 52499MCK 11205           0-0

The classification can also by joined to every row of mydat:

mydat[class, on = .(code, item)]
        code  item sales action scenario.name
 1: 52382MCK 11709    30      0           1-2
 2: 52382MCK 11709    10      1           1-2
 3: 52382MCK 11709    20      0           1-2
 4: 52382MCK 11709    15      0           1-2
 5: 52382MCK 11708     2      0           1-1
 6: 52382MCK 11708    10      1           1-1
 7: 52382MCK 11708     3      0           1-1
 8: 52382MCK 11710    30      0           3-3
 9: 52382MCK 11710    10      0           3-3
10: 52382MCK 11710    20      0           3-3
11: 52382MCK 11710    15      1           3-3
12: 52382MCK 11710     2      0           3-3
13: 52382MCK 11710    10      0           3-3
14: 52382MCK 11710     3      0           3-3
15: 52382MCK 11710    30      0           3-3
16: 52382MCK 11710    10      0           3-3
17: 52382MCK 11710    20      0           3-3
18: 52382MCK 11710    15      1           3-3
19: 52382MCK 11710     2      0           3-3
20: 52382MCK 11710    10      0           3-3
21: 52382MCK 11710     3      0           3-3
22: 52382MCK 11710    30      0           3-3
23: 52382MCK 11710    10      0           3-3
24: 52382MCK 11710    20      0           3-3
25: 52382MCK 11710    15      1           3-3
26: 52382MCK 11710     2      0           3-3
27: 52382MCK 11710    10      0           3-3
28: 52382MCK 11710     3      0           3-3
29: 52499MCK 11202     2      0           no1
30: 52499MCK 11203     2      0           1-0
31: 52499MCK 11203     2      1           1-0
32: 52499MCK 11204     2      0           no1
33: 52499MCK 11204     2      0           no1
34: 52499MCK 11205     2      1           0-0
35: 52499MCK 11205     2      1           0-0
        code  item sales action scenario.name

Caveat

The answer is based on my understanding of OP's requirements and includes a lot of speculation and implicit assumptions on OP's production dataset.

Data

Enhanced version of mydat

mydat <- 
structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("52382MCK", 
"52499MCK"), class = "factor"), item = c(11709L, 11709L, 11709L, 
11709L, 11708L, 11708L, 11708L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11202L, 11203L, 11203L, 11204L, 11204L, 11205L, 11205L
), sales = c(30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 
15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 
20L, 15L, 2L, 10L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), action = c(0L, 
1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 1L)), row.names = c(NA, -35L), class = "data.frame")
# coerce to data.table
setDT(mydat)

EDIT: Splitting by scenarios

The OP has requested to create a separate dataset for each of the 17 scenarios.

In general, I do not recommend to follow this approach unless there are good reasons, additional requirements, or external constraints. Imagine, you were a database administrator. Would you want to create 17 separate tables of identical structure in your database instead of one table where processing can be controlled by selecting subsets?

However, there is the possibility to split() the dataset by scenario:

split(mydat[class, on = .(code, item)], by = "scenario.name")
$`1-2`
       code  item sales action scenario.name
1: 52382MCK 11709    30      0           1-2
2: 52382MCK 11709    10      1           1-2
3: 52382MCK 11709    20      0           1-2
4: 52382MCK 11709    15      0           1-2

$`1-1`
       code  item sales action scenario.name
1: 52382MCK 11708     2      0           1-1
2: 52382MCK 11708    10      1           1-1
3: 52382MCK 11708     3      0           1-1

$`3-3`
        code  item sales action scenario.name
 1: 52382MCK 11710    30      0           3-3
 2: 52382MCK 11710    10      0           3-3
 3: 52382MCK 11710    20      0           3-3
 4: 52382MCK 11710    15      1           3-3
 5: 52382MCK 11710     2      0           3-3
 6: 52382MCK 11710    10      0           3-3
 7: 52382MCK 11710     3      0           3-3
 8: 52382MCK 11710    30      0           3-3
 9: 52382MCK 11710    10      0           3-3
10: 52382MCK 11710    20      0           3-3
11: 52382MCK 11710    15      1           3-3
12: 52382MCK 11710     2      0           3-3
13: 52382MCK 11710    10      0           3-3
14: 52382MCK 11710     3      0           3-3
15: 52382MCK 11710    30      0           3-3
16: 52382MCK 11710    10      0           3-3
17: 52382MCK 11710    20      0           3-3
18: 52382MCK 11710    15      1           3-3
19: 52382MCK 11710     2      0           3-3
20: 52382MCK 11710    10      0           3-3
21: 52382MCK 11710     3      0           3-3
        code  item sales action scenario.name

$no1
       code  item sales action scenario.name
1: 52499MCK 11202     2      0           no1
2: 52499MCK 11204     2      0           no1
3: 52499MCK 11204     2      0           no1

$`1-0`
       code  item sales action scenario.name
1: 52499MCK 11203     2      0           1-0
2: 52499MCK 11203     2      1           1-0

$`0-0`
       code  item sales action scenario.name
1: 52499MCK 11205     2      1           0-0
2: 52499MCK 11205     2      1           0-0

The result is a list a data.tables.

Single elements can be picked from the list by

split(mydat[class, on = .(code, item)], by = "scenario.name")[["1-2"]]
       code  item sales action scenario.name
1: 52382MCK 11709    30      0           1-2
2: 52382MCK 11709    10      1           1-2
3: 52382MCK 11709    20      0           1-2
4: 52382MCK 11709    15      0           1-2

If it is still required for some reason to clutter the workspace with many single data objects this can be done as follows (not recommended)

# workspace before
ls()
[1] "class"     "max_zeros" "mydat"     "sc"        "sc0"       "zeros"  
# create separate datasets
mydat[class, on = .(code, item)] %>% 
  split(by = "scenario.name") %>% 
  set_names(names(.) %>% make.names()) %>%   # make syntactically valid names 
  list2env(.GlobalEnv)
<environment: R_GlobalEnv>
#workspace after
ls()
 [1] "class"     "max_zeros" "mydat"     "no1"       "sc"        "sc0"       "X0.0"     
 [8] "X1.0"      "X1.1"      "X1.2"      "X3.3"      "zeros"