This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.