I have a big character matrix (15000 x 150), and with the following format:
A B C D
[1,] "0/0" "0/1" "0/0" "1/1"
[2,] "1/1" "1/1" "0/1" "0/1"
[3,] "1/2" "0/3" "1/1" "2/2"
[4,] "0/0" "0/0" "2/2" "0/0"
[5,] "0/0" "0/0" "0/0" "0/0"
I need to do pairwise comparison between columns and get the proportion of rows where
- neither string separated by
'/'
is equal (coded as 0);
- only one string separated by
'/'
is equal (coded as 1);
- both strings separated by
'/'
are equal (coded as 2).
The expected output for the above sample 5 x 4 matrix is
0 1 2
A B 0.2 0.2 0.6
A C 0.2 0.4 0.4
A D 0.2 0.4 0.4
B C 0.4 0.4 0.2
B D 0.2 0.4 0.4
C D 0.6 0.0 0.4
I have tried using pmatch
, however not able to do pairwise comparison to get the above output. any help is appreciated.
Revised question
Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?
This is what I could provide so far:
fun1 <- function (S) {
n <- ncol(S)
ref2 <- combn(colnames(S), 2)
ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
k <- 1L
for (j in 1:(n - 1)) {
x <- scan(text = S[, j], what = integer(), sep = "/", quiet = TRUE)
for (i in (j + 1):n) {
y <- scan(text = S[, i], what = integer(), sep = "/", quiet = TRUE)
count <- tabulate(.colSums(x == y, 2L, length(x) / 2L) + 1L)
z[k, ] <- count / sum(count)
k <- k + 1L
}
}
z
}
It looks bad as it has a double loop nest written in R, but the innermost kernel is extremely efficient by using scan
, .colSums
and tabulate
. The total number of iterations is choose(ncol(S), 2)
, not too many for your 150-column matrix. I can replace fun1
by an Rcpp version if you want.
## your data
S <- structure(c("0/0", "1/1", "1/2", "0/0", "0/0", "0/1", "1/1",
"0/3", "0/0", "0/0", "0/0", "0/1", "1/1", "2/2", "0/0", "1/1",
"0/1", "2/2", "0/0", "0/0"), .Dim = c(5L, 4L), .Dimnames = list(
NULL, c("A", "B", "C", "D")))
fun1(S)
# 0 1 2
#A&B 0.2 0.2 0.6
#A&C 0.2 0.4 0.4
#A&D 0.2 0.4 0.4
#B&C 0.4 0.4 0.2
#B&D 0.2 0.4 0.4
#C&D 0.6 0.0 0.4
Performance
Ha, when I actually test my function on a 15000 x 150 matrix I found that:
- I could move
scan
out of the loop nest for speedup, that is, I could scan the character matrix into an integer matrix in one go;
scan(text = blabla)
takes forever, while scan(file = blabla)
is fast, so it could be worth reading data from a text file;
- working with a text file is sensitive to the format of the file so writing a robust code is tricky.
I produced a version fun2
with file access, and a version fun3
using Rcpp for the loop nest. It turns out that:
- reading from file is indeed better (but we have to provide the file in "/" separated format);
- Rcpp implemenation of the loop is rewarding.
I came back and posted them here (see revision 2), and I saw user20650's starting with strsplit
. I excluded strsplit
from my option when I started, because I think operation with string can be slow. Yes, it is slow, but still faster than scan
. So I wrote a fun4
using strsplit
and a corresponding fun5
with Rcpp (see revision 3). Profiling says that 60% of the execution time is spent in strsplit
so it is indeed a performance killer. Then I replaced strsplit
, unlist
, as.integer
and matrix
with a single, simpler C++ implementation. It yields a 10x boost!! Well, this is reasonable if you think about it carefully. By using atoi
(or strtol
) from C library <stdlib.h>
, we can directly translate strings into integers, so all string operations are eliminated!
Long story short, I only provide the final, fastest version.
library(Rcpp)
cppFunction("IntegerMatrix getInt (CharacterMatrix Char) {
int m = Char.nrow(), n = Char.ncol();
IntegerMatrix Int(2 * m, n);
char *s1, *s2;
int i, *iptr = &Int(0, 0);
for (i = 0; i < m * n; i++) {
s1 = (char *)Char[i]; s2 = s1;
while(*s2 != '/') s2++; *iptr++ = atoi(s1);
s2++; *iptr++ = atoi(s2);
}
return Int;
}")
cppFunction('NumericMatrix pairwise(NumericMatrix z, IntegerMatrix Int) {
int m = Int.nrow() / 2, n = Int.ncol();
int i, j, k, *x, *y, count[3], *end; bool b1 = 0, b2 = 0;
double M = 1 / (double)m;
for (k = 0, j = 0; j < (n - 1); j++) {
end = &Int(2 * m, j);
for (i = j + 1; i < n; i++, k++) {
x = &Int(0, j); y = &Int(0, i);
count[0] = 0; count[1] = 0; count[2] = 0;
for (; x < end; x += 2, y += 2) {
b1 = (x[0] == y[0]);
b2 = (x[1] == y[1]);
count[(int)b1 + (int)b2]++;
}
z(k, 0) = (double)count[0] * M;
z(k, 1) = (double)count[1] * M;
z(k, 2) = (double)count[2] * M;
}
}
return z;
}')
fun7 <- function (S) {
## separate rows using Rcpp; `Int` is an integer matrix
n <- ncol(S)
Int <- getInt(S)
m <- nrow(Int) / 2
## initialize the resulting matrix `z`
ref2 <- combn(colnames(S), 2)
ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
## use Rcpp for pairwise summary
pairwise(z, Int)
}
Let's generate a random 15000 x 150 matrix and try it.
sim <- function (m, n) {
matrix(sample(c("0/0", "0/1", "1/0", "1/1"), m * n, TRUE), m, n,
dimnames = list(NULL, 1:n))
}
S <- sim(15000, 150)
system.time(oo <- fun7(S))
# user system elapsed
# 1.324 0.000 1.325
Oh this is lightening fast!
Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?
This kind of adaptation is straightforward at C / C++ level. Just an addition if
test.
## a new C++ function `pairwise_exclude00`
cppFunction('NumericMatrix pairwise_exclude00(NumericMatrix z, IntegerMatrix Int) {
int m = Int.nrow() / 2, n = Int.ncol();
int i, j, k, *x, *y, count[3], size, *end;
bool b1 = 0, b2 = 0, exclude = 0;
double M;
for (k = 0, j = 0; j < (n - 1); j++) {
end = &Int(2 * m, j);
for (i = j + 1; i < n; i++, k++) {
x = &Int(0, j); y = &Int(0, i);
count[0] = 0; count[1] = 0; count[2] = 0; size = 0;
for (; x < end; x += 2, y += 2) {
b1 = (x[0] == y[0]);
b2 = (x[1] == y[1]);
exclude = (x[0] == 0) & (x[1] == 0) & b1 & b2;
if (!exclude) {
count[(int)b1 + (int)b2]++;
size++;
}
}
M = 1 / (double)size;
z(k, 0) = (double)count[0] * M;
z(k, 1) = (double)count[1] * M;
z(k, 2) = (double)count[2] * M;
}
}
return z;
}')
## re-define `fun7` with a new logical argument `exclude00`
fun7 <- function (S, exclude00) {
## separate rows using Rcpp; `Int` is an integer matrix
n <- ncol(S)
Int <- getInt(S)
m <- nrow(Int) / 2
## initialize the resulting matrix `z`
ref2 <- combn(colnames(S), 2)
ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
## use Rcpp for pairwise summary
if (exclude00) pairwise_exclude00(z, Int)
else pairwise(z, Int)
}
Using the example S
in your question:
fun7(S, TRUE)
# 0 1 2
#A&B 0.3333333 0.3333333 0.3333333
#A&C 0.3333333 0.6666667 0.0000000
#A&D 0.3333333 0.6666667 0.0000000
#B&C 0.5000000 0.5000000 0.0000000
#B&D 0.3333333 0.6666667 0.0000000
#C&D 0.7500000 0.0000000 0.2500000
This uses ideas from 李哲源;s answer, particularly the tabulate
-- gives a wee speed up. For data 15000x160 takes ~14 seconds on my old laptop
# split strings and form matrix for each column
ap = matrix(unlist(strsplit(m, "/")), nc=2, byrow=TRUE)
ap = split.data.frame(ap, rep(colnames(m), each=nrow(m))) # maybe a way to use array?
# get 2-way combination of column names
co = combn(colnames(m), 2)
# test equality of each matrix
ap = apply(co, 2, function(x) tabulate(rowSums(ap[[x[1]]]==ap[[x[2]]])+1, 3))
# output
data.frame(t(co), t(ap)/nrow(m))
data
m = as.matrix(read.table(header=T, text=' A B C D
"0/0" "0/1" "0/0" "1/1"
"1/1" "1/1" "0/1" "0/1"
"1/2" "0/3" "1/1" "2/2"
"0/0" "0/0" "2/2" "0/0"
"0/0" "0/0" "0/0" "0/0"'))
m = do.call(cbind, replicate(40 , m, simplify = FALSE))
m = do.call(rbind, replicate(3000, m, simplify = FALSE))
colnames(m) = paste0("A", 1:160)
You can create 3 functions to indicate 0,1,2 conditions and then iterate over column names to have distinct pairs and apply functions to create resulting data.frame:
library(tidyr)
matrix <- read.csv("matrix.csv", stringsAsFactors = F)
n <-nrow(matrix)
c <- ncol(matrix)
zero <- function(A, B){ res <- sum(!grepl("0", A) & !grepl("0", B))/n }
one <- function(A, B) {
A <- unlist(str_split(A, "/"))
B <- unlist(str_split(B, "/"))
comp <-data.frame(cbind(A==B, c(1,2), id= sort(rep(1:n,2))))%>%spread(V2, V1)
res <- sum(sum(comp[,2]+comp[,3])>0)/n}
two <- function(A, B){res <- sum(A==B)/n}
res <-data.frame()
k <-1
for (i in 1:(c-1)){
for (j in (i+1):c){
A<-matrix[,i]
B<-matrix[,j]
res[k,1] <- colnames(matrix)[i]
res[k,2] <- colnames(matrix)[j]
res[k,3] <- zero(A,B)
res[k,4] <- one(A,B)
res[k,5] <- two(A,B)
k <-k+1
}
}
colnames(res) <-c("G1", "G2", "0", "1", "2")