I have this code:
library(biglm)
library(ff)
myData <- read.csv.ffdf(file = "myFile.csv")
testData <- read.csv(file = "test.csv")
form <- dependent ~ .
model <- biglm(form, data=myData)
predictedData <- predict(model, newdata=testData)
the model is created without problems, but when I make the prediction... it runs out of memory:
unable to allocate a vector of size xx.x MB
some hints?
or how to use ff to reserve memory for predictedData variable?
I have not used biglm
package before. Based on what you said, you ran out of memory when calling predict
, and you have nearly 7,000,000 rows for new dataset.
To resolve the memory issue, prediction must be done chunk-wise. For example, you iteratively predict 20,000 rows at a time. I am not sure whether the predict.bigglm
can do chunk-wise prediction.
Why not have a look at mgcv
pacakage? bam
can fit linear models / generalized linear models / generalized additive models, etc, for large data set. Similar to biglm
, it performs chunk-wise matrix factorization when fitting model. But, the predict.bam
supports chunk-wise prediction, which is really useful for your case. Furthermore, it does parallel model fitting and model prediction, backed by parallel
package [use argument cluster
of bam()
; see examples under ?bam
and ?predict.bam
for parallel examples].
Just do library(mgcv)
, and check ?bam
, ?predict.bam
.
Remark
Do not use nthreads
argument for parallelism. That is not useful for parametric regression.
Here are the possible causes and solutions:
Cause: You're using 32-bit R
Solution: Use 64-bit R
Cause: You're just plain out of RAM
Solution: Allocate more RAM if you can (?memory.limit
). If you can't then consider using ff
, working in chunks, running gc()
, or at worst scaling up by leveraging a cloud. Chunking is often the key to success with Big Data -- try doing the projections 10% at a time, saving the results to disk after each chunk and removing the in-memory objects after use.
Cause: There's a bug in your code leaking memory
Solution: Fix the bug -- this doesn't look like it's your case, however make sure that you have data of the expected size and keep an eye on your resource monitor program to make sure nothing funny is going on.
I've tryed with biglm and mgcv but memory and factor problems came quickly. I have had some success with: h2o library.