Pairs of Observations within Groups

2019-06-14 14:43发布

I've got a problem that I know how to solve using SQL, but I'm looking to implement a solution in R with a new data set. I've been trying to figure out things with the reshape2 package, but I haven't had any luck with what I'm trying to accomplish. Here's my problem:

I have a dataset in which I need to look at all pairs of items that are together from within another group. I've created a toy example below to further explain.

BUNCH    FRUITS
1        apples
1        bananas
1        mangos
2        apples
3        bananas
3        apples
4        bananas
4        apples

What I want is a listing of all possible pairs and sum the frequency they occur together within a bunch. My output would ideally look like this:

FRUIT1    FRUIT2     FREQUENCY
APPLES    BANANAS    3
APPLES    MANGOS     1

My end goal is to make something that I'll eventually be able to import into Gephi for a network analysis. For this I need a Source and Target column (aka FRUIT1 and FRUIT2 above).

The original solution in SQL is here if that would help anyone: PROC SQL in SAS - All Pairs of Items

标签: r dataset
1条回答
beautiful°
2楼-- · 2019-06-14 15:20

The following seems valid:

tmp = table(DF$FRUITS, DF$BUNCH) != 0
#> tmp         
#             1     2     3     4
#  apples  TRUE  TRUE  TRUE  TRUE
#  bananas TRUE FALSE  TRUE  TRUE
#  mangos  TRUE FALSE FALSE FALSE

do.call(rbind, 
        combn(unique(as.character(DF$FRUITS)), 
              2,
              function(x) data.frame(fr1 = x[1], 
                                     fr2 = x[2], 
                                     freq = sum(colSums(tmp[x, ]) == 2)), 
              simplify = F))
#      fr1     fr2 freq
#1  apples bananas    3
#2  apples  mangos    1
#3 bananas  mangos    1

Where DF:

DF = structure(list(BUNCH = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L), FRUITS = structure(c(1L, 
2L, 3L, 1L, 2L, 1L, 2L, 1L), .Label = c("apples", "bananas", 
"mangos"), class = "factor")), .Names = c("BUNCH", "FRUITS"), class = "data.frame", row.names = c(NA, 
-8L))
查看更多
登录 后发表回答