(fast) pairwise comparison of matrix columns whose

I have a big character matrix (15000 x 150), and with the following format:

       A     B       C     D
[1,] "0/0" "0/1"   "0/0" "1/1"
[2,] "1/1" "1/1"   "0/1" "0/1"
[3,] "1/2" "0/3"   "1/1" "2/2"
[4,] "0/0" "0/0"   "2/2" "0/0"
[5,] "0/0" "0/0"   "0/0" "0/0"

I need to do pairwise comparison between columns and get the proportion of rows where

neither string separated by '/' is equal (coded as 0);
only one string separated by '/' is equal (coded as 1);
both strings separated by '/' are equal (coded as 2).

The expected output for the above sample 5 x 4 matrix is

          0    1    2
A    B    0.2  0.2  0.6
A    C    0.2  0.4  0.4  
A    D    0.2  0.4  0.4
B    C    0.4  0.4  0.2
B    D    0.2  0.4  0.4  
C    D    0.6  0.0  0.4

I have tried using pmatch, however not able to do pairwise comparison to get the above output. any help is appreciated.

Revised question

Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?

回答1:

This is what I could provide so far:

fun1 <- function (S) {
  n <- ncol(S)
  ref2 <- combn(colnames(S), 2)
  ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
  z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
  k <- 1L
  for (j in 1:(n - 1)) {
    x <- scan(text = S[, j], what = integer(), sep = "/", quiet = TRUE)
    for (i in (j + 1):n) {
      y <- scan(text = S[, i], what = integer(), sep = "/", quiet = TRUE)
      count <- tabulate(.colSums(x == y, 2L, length(x) / 2L) + 1L)
      z[k, ] <- count / sum(count)
      k <- k + 1L
      }
    }
  z
  }

It looks bad as it has a double loop nest written in R, but the innermost kernel is extremely efficient by using scan, .colSums and tabulate. The total number of iterations is choose(ncol(S), 2), not too many for your 150-column matrix. I can replace fun1 by an Rcpp version if you want.

## your data
S <- structure(c("0/0", "1/1", "1/2", "0/0", "0/0", "0/1", "1/1", 
"0/3", "0/0", "0/0", "0/0", "0/1", "1/1", "2/2", "0/0", "1/1", 
"0/1", "2/2", "0/0", "0/0"), .Dim = c(5L, 4L), .Dimnames = list(
NULL, c("A", "B", "C", "D")))

fun1(S)
#      0   1   2
#A&B 0.2 0.2 0.6
#A&C 0.2 0.4 0.4
#A&D 0.2 0.4 0.4
#B&C 0.4 0.4 0.2
#B&D 0.2 0.4 0.4
#C&D 0.6 0.0 0.4

Performance

Ha, when I actually test my function on a 15000 x 150 matrix I found that:

I could move scan out of the loop nest for speedup, that is, I could scan the character matrix into an integer matrix in one go;
scan(text = blabla) takes forever, while scan(file = blabla) is fast, so it could be worth reading data from a text file;
working with a text file is sensitive to the format of the file so writing a robust code is tricky.

I produced a version fun2 with file access, and a version fun3 using Rcpp for the loop nest. It turns out that:

reading from file is indeed better (but we have to provide the file in "/" separated format);
Rcpp implemenation of the loop is rewarding.

I came back and posted them here (see revision 2), and I saw user20650's starting with strsplit. I excluded strsplit from my option when I started, because I think operation with string can be slow. Yes, it is slow, but still faster than scan. So I wrote a fun4 using strsplit and a corresponding fun5 with Rcpp (see revision 3). Profiling says that 60% of the execution time is spent in strsplit so it is indeed a performance killer. Then I replaced strsplit, unlist, as.integer and matrix with a single, simpler C++ implementation. It yields a 10x boost!! Well, this is reasonable if you think about it carefully. By using atoi (or strtol) from C library <stdlib.h>, we can directly translate strings into integers, so all string operations are eliminated!

Long story short, I only provide the final, fastest version.

library(Rcpp)

cppFunction("IntegerMatrix getInt (CharacterMatrix Char) {
  int m = Char.nrow(), n = Char.ncol();
  IntegerMatrix Int(2 * m, n);
  char *s1, *s2;
  int i, *iptr = &Int(0, 0);
  for (i = 0; i < m * n; i++) {
    s1 = (char *)Char[i]; s2 = s1;
    while(*s2 != '/') s2++; *iptr++ = atoi(s1);
    s2++; *iptr++ = atoi(s2);
    }
  return Int;
  }")

cppFunction('NumericMatrix pairwise(NumericMatrix z, IntegerMatrix Int) {
  int m = Int.nrow() / 2, n = Int.ncol();
  int i, j, k, *x, *y, count[3], *end; bool b1 = 0, b2 = 0;
  double M = 1 / (double)m;
  for (k = 0, j = 0; j < (n - 1); j++) {
    end = &Int(2 * m, j);
    for (i = j + 1; i < n; i++, k++) {
      x = &Int(0, j); y = &Int(0, i);
      count[0] = 0; count[1] = 0; count[2] = 0;
      for (; x < end; x += 2, y += 2) {
        b1 = (x[0] == y[0]);
        b2 = (x[1] == y[1]);
        count[(int)b1 + (int)b2]++;
        }
      z(k, 0) = (double)count[0] * M;
      z(k, 1) = (double)count[1] * M;
      z(k, 2) = (double)count[2] * M;
      }
    }
  return z;
  }')

fun7 <- function (S) {
  ## separate rows using Rcpp; `Int` is an integer matrix
  n <- ncol(S)
  Int <- getInt(S)
  m <- nrow(Int) / 2
  ## initialize the resulting matrix `z`
  ref2 <- combn(colnames(S), 2)
  ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
  z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
  ## use Rcpp for pairwise summary
  pairwise(z, Int)
  }

Let's generate a random 15000 x 150 matrix and try it.

sim <- function (m, n) {
  matrix(sample(c("0/0", "0/1", "1/0", "1/1"), m * n, TRUE), m, n,
         dimnames = list(NULL, 1:n))
  }

S <- sim(15000, 150)
system.time(oo <- fun7(S))
#   user  system elapsed 
#  1.324   0.000   1.325

Oh this is lightening fast!

Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?

This kind of adaptation is straightforward at C / C++ level. Just an addition if test.

## a new C++ function `pairwise_exclude00`
cppFunction('NumericMatrix pairwise_exclude00(NumericMatrix z, IntegerMatrix Int) {
  int m = Int.nrow() / 2, n = Int.ncol();
  int i, j, k, *x, *y, count[3], size, *end;
  bool b1 = 0, b2 = 0, exclude = 0;
  double M; 
  for (k = 0, j = 0; j < (n - 1); j++) {
    end = &Int(2 * m, j);
    for (i = j + 1; i < n; i++, k++) {
      x = &Int(0, j); y = &Int(0, i);
      count[0] = 0; count[1] = 0; count[2] = 0; size = 0;
      for (; x < end; x += 2, y += 2) {
        b1 = (x[0] == y[0]);
        b2 = (x[1] == y[1]);
        exclude = (x[0] == 0) & (x[1] == 0) & b1 & b2;
        if (!exclude) {
          count[(int)b1 + (int)b2]++;
          size++;
          }
        }
      M = 1 / (double)size;
      z(k, 0) = (double)count[0] * M;
      z(k, 1) = (double)count[1] * M;
      z(k, 2) = (double)count[2] * M;
      }
    }
  return z;
  }')

## re-define `fun7` with a new logical argument `exclude00`
fun7 <- function (S, exclude00) {
  ## separate rows using Rcpp; `Int` is an integer matrix
  n <- ncol(S)
  Int <- getInt(S)
  m <- nrow(Int) / 2
  ## initialize the resulting matrix `z`
  ref2 <- combn(colnames(S), 2)
  ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
  z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
  ## use Rcpp for pairwise summary
  if (exclude00) pairwise_exclude00(z, Int)
  else pairwise(z, Int)
  }

Using the example S in your question:

fun7(S, TRUE)
#            0         1         2
#A&B 0.3333333 0.3333333 0.3333333
#A&C 0.3333333 0.6666667 0.0000000
#A&D 0.3333333 0.6666667 0.0000000
#B&C 0.5000000 0.5000000 0.0000000
#B&D 0.3333333 0.6666667 0.0000000
#C&D 0.7500000 0.0000000 0.2500000

回答2:

This uses ideas from 李哲源;s answer, particularly the tabulate -- gives a wee speed up. For data 15000x160 takes ~14 seconds on my old laptop

# split strings and form matrix for each column
ap =  matrix(unlist(strsplit(m, "/")), nc=2, byrow=TRUE)
ap = split.data.frame(ap, rep(colnames(m), each=nrow(m))) # maybe a way to use array?

# get 2-way combination of column names
co = combn(colnames(m), 2)

# test equality of each matrix
ap = apply(co, 2, function(x) tabulate(rowSums(ap[[x[1]]]==ap[[x[2]]])+1, 3))

# output
data.frame(t(co), t(ap)/nrow(m))

data

m = as.matrix(read.table(header=T, text='       A     B       C     D
 "0/0" "0/1"   "0/0" "1/1"
 "1/1" "1/1"   "0/1" "0/1"
 "1/2" "0/3"   "1/1" "2/2"
 "0/0" "0/0"   "2/2" "0/0"
 "0/0" "0/0"   "0/0" "0/0"'))

m = do.call(cbind, replicate(40 , m, simplify = FALSE))
m = do.call(rbind, replicate(3000, m, simplify = FALSE))
colnames(m) =  paste0("A", 1:160)

回答3:

You can create 3 functions to indicate 0,1,2 conditions and then iterate over column names to have distinct pairs and apply functions to create resulting data.frame:

library(tidyr)
matrix <- read.csv("matrix.csv", stringsAsFactors = F)
n <-nrow(matrix)
c <- ncol(matrix)
zero <- function(A, B){ res <- sum(!grepl("0", A) & !grepl("0", B))/n }
one <- function(A, B) {
  A <- unlist(str_split(A, "/"))
  B <- unlist(str_split(B, "/"))
  comp <-data.frame(cbind(A==B, c(1,2), id= sort(rep(1:n,2))))%>%spread(V2, V1)
  res <- sum(sum(comp[,2]+comp[,3])>0)/n} 
two <- function(A, B){res <- sum(A==B)/n}  

res <-data.frame()
k <-1
for (i in 1:(c-1)){
  for (j in (i+1):c){
    A<-matrix[,i]
    B<-matrix[,j]
    res[k,1] <- colnames(matrix)[i]
    res[k,2] <- colnames(matrix)[j]
    res[k,3] <- zero(A,B)
    res[k,4] <- one(A,B)
    res[k,5] <- two(A,B)
    k <-k+1
  }
}
colnames(res) <-c("G1", "G2", "0", "1", "2")