svd of very large matrix in R program

2019-03-14 02:44发布

问题:

I have a matrix 60 000 x 60 000 in a txt file, I need to get svd of this matrix. I use R but I don´t know if R can generate it.

回答1:

I think it's possible to compute (partial) svd using the irlba package and bigmemory and bigalgebra without using a lot of memory.

First let's create a 20000 * 20000 matrix and save it into a file

require(bigmemory)
require(bigalgebra)
require(irlba)

con <- file("mat.txt", open = "a")
replicate(20, {
    x <- matrix(rnorm(1000 * 20000), nrow = 1000)
    write.table(x, file  = 'mat.txt', append = TRUE,
            row.names = FALSE, col.names = FALSE)
})

file.info("mat.txt")$size
## [1] 7.264e+09   7.3 Gb
close(con)

Then you can read this matrix using bigmemory::read.big.matrix

bigm <- read.big.matrix("mat.txt", sep = " ",
                        type = "double",
                        backingfile = "mat.bk",
                        backingpath = "/tmp",
                        descriptorfile = "mat.desc")

str(bigm)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slots
##   ..@ address:<externalptr>

dim(bigm)
## [1] 20000 20000

bigm[1:3, 1:3]
##            [,1]     [,2]     [,3]
## [1,] -0.3623255 -0.58463 -0.23172
## [2,] -0.0011427  0.62771  0.73589
## [3,] -0.1440494 -0.59673 -1.66319

Now we can use the use the excellent irlba package as explained in the package vignette.

The first step consist of defining matrix multiplication operator which can work with big.matrix object and then use the irlba::irlba function

### vignette("irlba", package = "irlba") # for more info

matmul <- function(A, B, transpose=FALSE) {
    ## Bigalgebra requires matrix/vector arguments
    if(is.null(dim(B))) B <- cbind(B)

    if(transpose)
        return(cbind((t(B) %*% A)[]))

    cbind((A %*% B)[])
}

dim(bigm)

system.time(
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
)

##    user  system elapsed 
## 169.820   0.923 170.194


str(S)
## List of 5
##  $ d    : num [1:2] 283 283
##  $ u    : num [1:20000, 1:2] -0.00615 -0.00753 -0.00301 -0.00615 0.00734 ...
##  $ v    : num [1:20000, 1:2] 0.020086 0.012503 0.001065 -0.000607 -0.006009 ...
##  $ iter : num 10
##  $ mprod: num 310

I forgot to set the seed to make it reproductible but I just wanted to show that it's possible to do that in R.

EDIT

If you are using a new version of the package irlba, the above code throw an error because the matmult parameter of the function irlba has been renamed to mult. Therefore, you should change this part of the code

S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)

By

S <- irlba(bigm, nu = 2, nv = 2, mult = matmul)

I want to thank @FrankD for pointing this out.



回答2:

In R 3.x+ you can construct a matrix of that size, the upper limit of vector sizes being 2^53 (or maybe 2^53-1 ), up from 2^31-1 as it was before which was why Andrie was throwing an error on his out-of-date installation. It generally takes 10 bytes per numeric element. At any rate:

> 2^53 < 10*60000^2
[1] FALSE  # so you are safe on that account.

It would also fit in 64GB (but not in 32GB):

> 64000000000 < 10*60000^2
[1] FALSE

Generally to do any serious work you need at least 3 times the size of your largest object, so this seems pretty borderline even with the new expanded vectors/matrices.