I have a matrix 60 000 x 60 000 in a txt file, I need to get svd of this matrix. I use R but I don´t know if R can generate it.
问题:
回答1:
I think it's possible to compute (partial) svd
using the irlba
package and bigmemory
and bigalgebra
without using a lot of memory.
First let's create a 20000 * 20000 matrix and save it into a file
require(bigmemory)
require(bigalgebra)
require(irlba)
con <- file("mat.txt", open = "a")
replicate(20, {
x <- matrix(rnorm(1000 * 20000), nrow = 1000)
write.table(x, file = 'mat.txt', append = TRUE,
row.names = FALSE, col.names = FALSE)
})
file.info("mat.txt")$size
## [1] 7.264e+09 7.3 Gb
close(con)
Then you can read this matrix using bigmemory::read.big.matrix
bigm <- read.big.matrix("mat.txt", sep = " ",
type = "double",
backingfile = "mat.bk",
backingpath = "/tmp",
descriptorfile = "mat.desc")
str(bigm)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slots
## ..@ address:<externalptr>
dim(bigm)
## [1] 20000 20000
bigm[1:3, 1:3]
## [,1] [,2] [,3]
## [1,] -0.3623255 -0.58463 -0.23172
## [2,] -0.0011427 0.62771 0.73589
## [3,] -0.1440494 -0.59673 -1.66319
Now we can use the use the excellent irlba
package as explained in the package vignette.
The first step consist of defining matrix multiplication operator which can work with big.matrix
object and then use the irlba::irlba
function
### vignette("irlba", package = "irlba") # for more info
matmul <- function(A, B, transpose=FALSE) {
## Bigalgebra requires matrix/vector arguments
if(is.null(dim(B))) B <- cbind(B)
if(transpose)
return(cbind((t(B) %*% A)[]))
cbind((A %*% B)[])
}
dim(bigm)
system.time(
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
)
## user system elapsed
## 169.820 0.923 170.194
str(S)
## List of 5
## $ d : num [1:2] 283 283
## $ u : num [1:20000, 1:2] -0.00615 -0.00753 -0.00301 -0.00615 0.00734 ...
## $ v : num [1:20000, 1:2] 0.020086 0.012503 0.001065 -0.000607 -0.006009 ...
## $ iter : num 10
## $ mprod: num 310
I forgot to set the seed to make it reproductible but I just wanted to show that it's possible to do that in R.
EDIT
If you are using a new version of the package irlba
, the above code throw an error because the matmult
parameter of the function irlba
has been renamed to mult
. Therefore, you should change this part of the code
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
By
S <- irlba(bigm, nu = 2, nv = 2, mult = matmul)
I want to thank @FrankD for pointing this out.
回答2:
In R 3.x+ you can construct a matrix of that size, the upper limit of vector sizes being 2^53 (or maybe 2^53-1 ), up from 2^31-1 as it was before which was why Andrie was throwing an error on his out-of-date installation. It generally takes 10 bytes per numeric element. At any rate:
> 2^53 < 10*60000^2
[1] FALSE # so you are safe on that account.
It would also fit in 64GB (but not in 32GB):
> 64000000000 < 10*60000^2
[1] FALSE
Generally to do any serious work you need at least 3 times the size of your largest object, so this seems pretty borderline even with the new expanded vectors/matrices.