Out of memory when using `outer` in solving my big

Consider the following example in R:

x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- t(g) %*% g
gginv <- solve(gg)
bigmatrix <- outer(x1, x2, "<=")
Gw <- t(g) %*% bigmatrix
beta <- gginv %*% Gw
w1 <- bigmatrix - g %*% beta

If I try to run such a thing in my computer, it will throw a memory error (because the bigmatrix is too big).

Do you know how can I achieve the same, without running into this problem?

标签： r out-of-memory regression linear-regression least-squares

1条回答

祖国的老花朵

2楼-- · 2019-05-29 19:33

This is a least squares problem with 100,000 responses. Your bigmatrix is the response (matrix), beta is the coefficient (matrix), while w1 is the residual (matrix).

bigmatrix, as well as w1, if formed explicitly, will each cost

(100,000 * 100,000 * 8) / (1024 ^ 3) = 74.5 GB

This is far too large.

As estimation for each response is independent, there is really no need to form bigmatrix in one go and try to store it in RAM. We can just form it tile by tile, and use an iterative procedure: form a tile, use a tile, then discard it. For example, the below considers a tile of dimension 100,000 * 2,000, with memory size:

(100,000 * 2,000 * 8) / (1024 ^ 3) = 1.5 GB

By such iterative procedure, the memory usage is effectively under control.

x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- crossprod(g)    ## don't use `t(g) %*% g`
## we also don't explicitly form `gg` inverse

## initialize `beta` matrix (4 coefficients for each of 100,000 responses)
beta <- matrix(0, 4, 100000)

## we split 100,000 columns into 50 tiles, each with 2000 columns
for (i in 1:50) {
   start <- 2000 * (i-1) + 1    ## chunk start
   end <- 2000 * i    ## chunk end
   bigmatrix <- outer(x1, x2[start:end], "<=")
   Gw <- crossprod(g, bigmatrix)    ## don't use `t(g) %*% bigmatrix`
   beta[, start:end] <- solve(gg, Gw)
   }

Note, don't try to compute the residual matrix w1, as It will cost 74.5 GB. If you need residual matrix in later work, you should still try to break it into tiles and work one by one.

You don't need to worry about the loop here. The computation inside each iteration is costly enough to amortize looping overhead.

0人赞添加讨论(0) 举报

Out of memory when using `outer` in solving my big

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间