Couple the data in all possible combinations

2019-09-03 15:05发布

问题:

I have data in column in two columns like this

Id  Value
1   a
2   f
1   c
1   h
2   a

and I'd like couple the data of the 'Value' column in all possible combinations based on the same Id such as

(a,c)
(a,h)
(c,h)
(f,a)

Is there any R or Python or VBA code to get this task?

回答1:

To return a character matrix with these combinations using base R, try

do.call(rbind, t(sapply(split(df, df$Id), function(i) t(combn(i$Value, 2)))))
     [,1] [,2]
[1,] "a"  "c" 
[2,] "a"  "h" 
[3,] "c"  "h" 
[4,] "f"  "a"

Each row is a desired combination.

To break this down a bit, split splits the data.frame by Id into a list of two data.frames. Then sapply is fed this list and the combn function to find the pairwise combinations within these data.frames. The result from each data.frame (which is a matrix) is transposed to fit your desired structure using t. Finally, this list of matrices is fed to do.call which uses rbind to return the final matrix.

Note: There is an assumption that the value column is character (not the pesky factor variable type). This is easily accomplished in the read. family of functions, like read.csv and read.table by adding the as.is=TRUE argument to your read function (or the longer stringsAsFactors=FALSE). If the variable is already a factor, you can wrap the i$Value statement near the end in as.character: as.character(i$Value) and it will run as desired.



回答2:

Using R you could try:

library(purrr)

df %>%
  split(.$Id) %>%
  map(~ t(combn(.$Value, 2)))

Which gives:

#$`1`
#     [,1] [,2]
#[1,] "a"  "c" 
#[2,] "a"  "h" 
#[3,] "c"  "h" 
#
#$`2`
#     [,1] [,2]
#[1,] f    a   
#Levels: a c f h


回答3:

Just another way (possibly slightly faster as it exploits the fact that you're looking for all pairs, and avoids combn and t):

require(data.table)
dt[, .( c1 = rep(Value, (.N:1)-1L), c2 = rep(Value, (1:.N)-1L) ), by=Id]
#    Id c1 c2
# 1:  1  a  c
# 2:  1  a  h
# 3:  1  c  h
# 4:  2  f  a

.N contains the number of observations for each group.


where dt is:

dt = fread('Id  Value
1   a
2   f
1   c
1   h
2   a')