biglm predict unable to allocate a vector of size

2019-07-14 06:11发布

问题:

I have this code:

library(biglm)
library(ff)

myData <- read.csv.ffdf(file = "myFile.csv")
testData <- read.csv(file = "test.csv")
form <- dependent ~ .
model <- biglm(form, data=myData)
predictedData <- predict(model, newdata=testData)

the model is created without problems, but when I make the prediction... it runs out of memory:

unable to allocate a vector of size xx.x MB

some hints? or how to use ff to reserve memory for predictedData variable?

回答1:

I have not used biglm package before. Based on what you said, you ran out of memory when calling predict, and you have nearly 7,000,000 rows for new dataset.

To resolve the memory issue, prediction must be done chunk-wise. For example, you iteratively predict 20,000 rows at a time. I am not sure whether the predict.bigglm can do chunk-wise prediction.

Why not have a look at mgcv pacakage? bam can fit linear models / generalized linear models / generalized additive models, etc, for large data set. Similar to biglm, it performs chunk-wise matrix factorization when fitting model. But, the predict.bam supports chunk-wise prediction, which is really useful for your case. Furthermore, it does parallel model fitting and model prediction, backed by parallel package [use argument cluster of bam(); see examples under ?bam and ?predict.bam for parallel examples].

Just do library(mgcv), and check ?bam, ?predict.bam.


Remark

Do not use nthreads argument for parallelism. That is not useful for parametric regression.



回答2:

Here are the possible causes and solutions:

  1. Cause: You're using 32-bit R

    Solution: Use 64-bit R

  2. Cause: You're just plain out of RAM

    Solution: Allocate more RAM if you can (?memory.limit). If you can't then consider using ff, working in chunks, running gc(), or at worst scaling up by leveraging a cloud. Chunking is often the key to success with Big Data -- try doing the projections 10% at a time, saving the results to disk after each chunk and removing the in-memory objects after use.

  3. Cause: There's a bug in your code leaking memory

    Solution: Fix the bug -- this doesn't look like it's your case, however make sure that you have data of the expected size and keep an eye on your resource monitor program to make sure nothing funny is going on.



回答3:

I've tryed with biglm and mgcv but memory and factor problems came quickly. I have had some success with: h2o library.