I have data in column in two columns like this
Id Value
1 a
2 f
1 c
1 h
2 a
and I'd like couple the data of the 'Value' column in all possible combinations based on the same Id such as
(a,c)
(a,h)
(c,h)
(f,a)
Is there any R or Python or VBA code to get this task?
To return a character matrix with these combinations using base R, try
do.call(rbind, t(sapply(split(df, df$Id), function(i) t(combn(i$Value, 2)))))
[,1] [,2]
[1,] "a" "c"
[2,] "a" "h"
[3,] "c" "h"
[4,] "f" "a"
Each row is a desired combination.
To break this down a bit, split
splits the data.frame by Id into a list of two data.frames. Then sapply
is fed this list and the combn
function to find the pairwise combinations within these data.frames. The result from each data.frame (which is a matrix) is transposed to fit your desired structure using t
. Finally, this list of matrices is fed to do.call
which uses rbind
to return the final matrix.
Note: There is an assumption that the value column is character (not the pesky factor variable type). This is easily accomplished in the read.
family of functions, like read.csv
and read.table
by adding the as.is=TRUE argument to your read function (or the longer stringsAsFactors=FALSE). If the variable is already a factor, you can wrap the i$Value
statement near the end in as.character
: as.character(i$Value)
and it will run as desired.
Using R
you could try:
library(purrr)
df %>%
split(.$Id) %>%
map(~ t(combn(.$Value, 2)))
Which gives:
#$`1`
# [,1] [,2]
#[1,] "a" "c"
#[2,] "a" "h"
#[3,] "c" "h"
#
#$`2`
# [,1] [,2]
#[1,] f a
#Levels: a c f h
Just another way (possibly slightly faster as it exploits the fact that you're looking for all pairs, and avoids combn
and t
):
require(data.table)
dt[, .( c1 = rep(Value, (.N:1)-1L), c2 = rep(Value, (1:.N)-1L) ), by=Id]
# Id c1 c2
# 1: 1 a c
# 2: 1 a h
# 3: 1 c h
# 4: 2 f a
.N
contains the number of observations for each group.
where dt is:
dt = fread('Id Value
1 a
2 f
1 c
1 h
2 a')