R Matrix. Set particular elements of sparse matrix

2019-08-05 20:04发布

问题:

I have reasonably large sparse matrix (dgCMatrix or dgTMatrix, but this is not very important here). And I want to set some elements to zero.
For example I have 3e4 * 3e4 matrix, which is upper triangular and it is quite dense: ~23% of elements are not zeros. (actually I have much bigger matrices ~ 1e5 * 1e5, but they are much more sparser) So in triplet dgTMatrix form it takes about 3.1gb of RAM. Now I want to set to zero all elements which are less some threshold (say, 1).

  1. Very naive approach (which also was discussed here) will be following:

    threshold <- 1
    m[m < threshold] <- 0
    

    But this solution is far from perfect - 130 sec runtime (on machine which has enough ram, so there is no swapping) and what is more important needs ~ 25-30gb additional RAM.

  2. Second solution I found (and mostly happy) is far more effective - construct new matrix from scratch:

    threshold <- 1
    ind <- which(m@x > threshold)
    m <- sparseMatrix(i = m@i[ind], j = m@j[ind], x = m@x[ind], 
                 dims = m@Dim, dimnames = m@Dimnames, 
                 index1 = FALSE, 
                 giveCsparse = FALSE, 
                 check = FALSE)
    

    It takes only ~ 6 sec and needs ~ 5gb additional RAM.

The question is - can we do better? Especially interesting, whether, can we do this with less RAM usage? It would be perfect if will be able to perform this in place.

回答1:

Like this:

library(Matrix)
m <- Matrix(0+1:28, nrow = 4)
m[-3,c(2,4:5,7)] <- m[ 3, 1:4] <- m[1:3, 6] <- 0
(m <- as(m, "dgTMatrix"))
m
#4 x 7 sparse Matrix of class "dgTMatrix"
#
#[1,] 1 .  9 .  .  .  .
#[2,] 2 . 10 .  .  .  .
#[3,] . .  . . 19  . 27
#[4,] 4 . 12 .  . 24  .

threshold <- 5
ind <- m@x <= threshold
m@x <- m@x[!ind]
m@i <- m@i[!ind]
m@j <- m@j[!ind]
m
#4 x 7 sparse Matrix of class "dgTMatrix"
#
#[1,] . .  9 .  .  .  .
#[2,] . . 10 .  .  .  .
#[3,] . .  . . 19  . 27
#[4,] . . 12 .  . 24  .

You only need the RAM for the ind vector. If you want to avoid that, you need a loop (probably in Rcpp for performance).



回答2:

Just came across this question.

The Matrix package includes the drop0() function, used as follows:

threshold <- th
m <- drop0(m, tol = th)

Which seems to work well.