I have a big character matrix (15000 x 150), and with the following format:
A B C D
[1,] "0/0" "0/1" "0/0" "1/1"
[2,] "1/1" "1/1" "0/1" "0/1"
[3,] "1/2" "0/3" "1/1" "2/2"
[4,] "0/0" "0/0" "2/2" "0/0"
[5,] "0/0" "0/0" "0/0" "0/0"
I need to do pairwise comparison between columns and get the proportion of rows where
- neither string separated by
'/'
is equal (coded as 0); - only one string separated by
'/'
is equal (coded as 1); - both strings separated by
'/'
are equal (coded as 2).
The expected output for the above sample 5 x 4 matrix is
0 1 2
A B 0.2 0.2 0.6
A C 0.2 0.4 0.4
A D 0.2 0.4 0.4
B C 0.4 0.4 0.2
B D 0.2 0.4 0.4
C D 0.6 0.0 0.4
I have tried using pmatch
, however not able to do pairwise comparison to get the above output. any help is appreciated.
Revised question
Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?
This uses ideas from 李哲源;s answer, particularly the
tabulate
-- gives a wee speed up. For data 15000x160 takes ~14 seconds on my old laptopdata
You can create 3 functions to indicate 0,1,2 conditions and then iterate over column names to have distinct pairs and apply functions to create resulting data.frame:
This is what I could provide so far:
It looks bad as it has a double loop nest written in R, but the innermost kernel is extremely efficient by using
scan
,.colSums
andtabulate
. The total number of iterations ischoose(ncol(S), 2)
, not too many for your 150-column matrix. I can replacefun1
by an Rcpp version if you want.Performance
Ha, when I actually test my function on a 15000 x 150 matrix I found that:
scan
out of the loop nest for speedup, that is, I could scan the character matrix into an integer matrix in one go;scan(text = blabla)
takes forever, whilescan(file = blabla)
is fast, so it could be worth reading data from a text file;I produced a version
fun2
with file access, and a versionfun3
using Rcpp for the loop nest. It turns out that:I came back and posted them here (see revision 2), and I saw user20650's starting with
strsplit
. I excludedstrsplit
from my option when I started, because I think operation with string can be slow. Yes, it is slow, but still faster thanscan
. So I wrote afun4
usingstrsplit
and a correspondingfun5
with Rcpp (see revision 3). Profiling says that 60% of the execution time is spent instrsplit
so it is indeed a performance killer. Then I replacedstrsplit
,unlist
,as.integer
andmatrix
with a single, simpler C++ implementation. It yields a 10x boost!! Well, this is reasonable if you think about it carefully. By usingatoi
(orstrtol
) from C library<stdlib.h>
, we can directly translate strings into integers, so all string operations are eliminated!Long story short, I only provide the final, fastest version.
Let's generate a random 15000 x 150 matrix and try it.
Oh this is lightening fast!
This kind of adaptation is straightforward at C / C++ level. Just an addition
if
test.Using the example
S
in your question: