I have a tsv that looks like this (long-form):
one two value
a b 30
a c 40
a d 20
b c 10
b d 05
c d 30
I'm trying to get this into a dataframe for R (or pandas)
a b c d
a 00 30 40 20
b 30 00 10 05
c 40 10 00 30
d 20 05 30 00
The problem is, in my tsv I only have a, b defined and not b,a. So I get a lot of NAs in my dataframe.
The final goal is to get a distance matrix to use in clustering. Any help would be appreciated.
An igraph
solution where you read in the dataframe, with the value assumed as edge weights. You can then convert this to an adjacency matrix
dat <- read.table(header=T, text=" one two value
a b 30
a c 40
a d 20
b c 10
b d 05
c d 30")
library(igraph)
# Make undirected so that graph matrix will be symmetric
g <- graph.data.frame(dat, directed=FALSE)
# add value as a weight attribute
get.adjacency(g, attr="value", sparse=FALSE)
# a b c d
#a 0 30 40 20
#b 30 0 10 5
#c 40 10 0 30
#d 20 5 30 0
Yet another approach is reshape::cast
df.long = data.frame(one=c('a','a','a','b','b','c'),
two=c('b','c','d','c','d','d'),
value=c(30,40,20,10,05,30) )
# cast will recover the upper/lower-triangles...
df <- as.matrix( cast(df.long, one ~ two, fill=0) )
# b c d
# a 30 40 20
# b 0 10 5
# c 0 0 30
So we construct matrix with full indices, and insert:
df <- matrix(nrow=length(indices), ncol=length(indices),dimnames = list(indices,indices))
diag(df) <- 0
# once we assure that the full upper-triangle is present and in sorted order (as Robert's answer does), then we
df[upper.tri(df)] <- as.matrix( cast(df.long, one ~ two, fill=0) )
df[lower.tri(df)] <- df[upper.tri(df)]
UPDATE: the original sketch included these manual kludges
Then the same approaches to add the missing row 'd' and column 'a', and fill the lower triangle by adding the transpose t(df) :
df <- cbind(a=rep(0,4), rbind(df, d=rep(0,3)))
# a b c d
# a 0 30 40 20
# b 0 0 10 5
# c 0 0 0 30
# d 0 0 0 0
df + t(df)
# a b c d
# a 0 30 40 20
# b 30 0 10 5
# c 40 10 0 30
# d 20 5 30 0
Make sure your data is sorted tsv=tsv[with(tsv,order(one,two)),]
, and try this:
n=4
B <- matrix(rep(0,n*n), n)
dimnames(B) <- list(letters[1:n],letters[1:n])
B[lower.tri(B)] <- tsv$value
B[upper.tri(B)]=tsv$value
B
You may try
un1 <- unique(unlist(df1[1:2]))
df1[1:2] <- lapply(df1[1:2], factor, levels=un1)
m1 <- xtabs(value~one+two, df1)
m1+t(m1)
# two
#one a b c d
#a 0 30 40 20
#b 30 0 10 5
#c 40 10 0 30
#d 20 5 30 0
Or you use the row/col
index
m1 <- matrix(0, nrow=length(un1), ncol=length(un1),
dimnames=list(un1, un1))
m1[cbind(match(df1$one, rownames(m1)),
match(df1$two, colnames(m1)))] <- df1$value
m1+t(m1)
# a b c d
#a 0 30 40 20
#b 30 0 10 5
#c 40 10 0 30
#d 20 5 30 0