查找特定列的所有组合，并找到自己的频率(Find all the combinations of a

2019-10-16 12:57发布

我的文件就像这 -

Pcol       Mcol
P1      M1,M2,M5,M6,M1,M2,M1.M5
P2      M1,M2,M3,M5,M1,M2,M1,M3
P3      M4,M5,M7,M6,M5,M7,M4,M7

我想find all the combination of Mcol elements ，并find these combinatinatons are present in how many rows 。

预计输出 -

Mcol        freq
M1,M2        2
M1,M5        2
M1,M6        1
M2,M5        2
M2,M6        1
M5,M6        2
M1,M3        1
M2,M3        1
M4,M5        1
M4,M7        1
M4,M6        1
M7,M6        1

我已经试过这 -

x <- read.csv("file.csv" ,header = TRUE, stringsAsFactors = FALSE)
xx <- do.call(rbind.data.frame, 
              lapply(x$Mcol, function(i){
                n <- sort(unlist(strsplit(i, ",")))
                t(combn(n, 2))
              }))

data.frame(table(paste(xx[, 1], xx[, 2], sep = ",")))

它没有给出预期输出

我自己也尝试以此为良好

library(tidyverse)
df1 %>%
   separate_rows(Mcol) %>%
   group_by(Pcol) %>%
   summarise(Mcol = list(combn(Mcol, 2, FUN= toString, simplify = FALSE))) %>% 
   unnest %>% 
   unnest %>%
   count(Mcol)

但它是不给存在于行数组合的频率。 I want the frequency of row in which these combinations are present 。这意味着if M1,M2 are present in P1 and P2 so it will calculate the frequency as 2 。

Answer 1:

在选项tidyverse将与被分拆“MCOL” separate_row ，通过“PCOL”分组，得到combn “MCOL”和后unnest ING取count “MCOL”列

library(tidyverse)
df1 %>%
   separate_rows(Mcol) %>%
   group_by(Pcol) %>%
   summarise(Mcol = list(combn(Mcol, 2, FUN= toString, simplify = FALSE))) %>% 
   unnest %>% 
   unnest %>%
   count(Mcol)
# A tibble: 14 x 2
#   Mcol       n
#   <chr>  <int>
# 1 M1, M2     2
# 2 M1, M3     1
# 3 M1, M5     2
# 4 M1, M6     1
# 5 M2, M3     1
# 6 M2, M5     2
# 7 M2, M6     1
# 8 M3, M5     1
# 9 M4, M5     1
#10 M4, M6     1
#11 M4, M7     1
#12 M5, M6     2
#13 M5, M7     1
#14 M7, M6     1

文章来源: Find all the combinations of a particular column and find their frequencies