I am trying to compare each row with all other rows in a matrix to count the number of differences of each row with all other rows. The result is then stored in the bottom left triangle of a matrix.
So for example when row m[1,] is compared with rows m[2,] and m[3,] the difference counts are stored at positions of mat[c(2:3), 1] in the result matrix.
My problem is that my input matrix can have upto 1e+07 rows and the current implementation (speed and memory) will not scale due to n^2 comparisons. Suggestions and help would be appreciated.
diffMatrix <- function(x) {
rows <- dim(x)[1] #num of rows
cols <- dim(x)[2] #num of columns
if (rows <= 1) stop("'x' must have atleast two rows")
#potential failure point
mat <- matrix(, rows, rows)
# fill bottom left triangle columns ignoring the diagonal
for (row in 1:(rows-1)) {
rRange <- c((1+row):rows)
m <- matrix(x[row,], nrow=rows-row, ncol=cols, byrow = T)
mat[rRange, row] <- rowSums(m != x[-1:-row, ])
}
return (mat)
}
m <- matrix(sample(1:12, 12, replace=T), ncol=2, byrow=TRUE)
m
# [,1] [,2]
#[1,] 8 1
#[2,] 4 1
#[3,] 8 4
#[4,] 4 5
#[5,] 3 1
#[6,] 2 2
x <- diffMatrix(m)
x
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] NA NA NA NA NA NA
#[2,] 1 NA NA NA NA NA
#[3,] 1 2 NA NA NA NA
#[4,] 2 1 2 NA NA NA
#[5,] 1 1 2 2 NA NA
#[6,] 2 2 2 2 2 NA
m <- matrix(sample(1:5, 50000, replace=T), ncol=10, byrow=TRUE)
# system.time(x <- diffMatrix(m))
# user system elapsed
# 20.39 0.38 21.43
Here is an alternative using
.Call
(seems valid, but I can't guarantee):