What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-help list in 2004] to list (and/or sort) the largest objects and to occassionally rm()
some of them. But by far the most effective solution was ... to run under 64-bit Linux with ample memory.
Any other nice tricks folks want to share? One per post, please.
# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.dim)
names(out) <- c("Type", "Size", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
out
}
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}
I quite like the improved objects function developed by Dirk. Much of the time though, a more basic output with the object name and size is sufficient for me. Here's a simpler function with a similar objective. Memory use can be ordered alphabetically or by size, can be limited to a certain number of objects, and can be ordered ascending or descending. Also, I often work with data that are 1GB+, so the function changes units accordingly.
And here is some example output:
I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.
Calling
gc ()
"by hand" can help if the size of the data get close to available memory.Sometimes a different algorithm needs much less memory.
Sometimes there's a trade off between vectorization and memory use.
compare:
split
&lapply
vs. afor
loop.For the sake of fast & easy data analysis, I often work first with a small random subset (
sample ()
) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.I use the data.table package. With its
:=
operator you can :None of these operations copy the (potentially large)
data.table
at all, not even once.data.table
uses much less working memory.Related links :
:=
operator in data.table?If you really want to avoid the leaks, you should avoid creating any big objects in the global environment.
What I usually do is to have a function that does the job and returns
NULL
— all data is read and manipulated in this function or others that it calls.Just to note that
data.table
package'stables()
seems to be a pretty good replacement for Dirk's.ls.objects()
custom function (detailed in earlier answers), although just for data.frames/tables and not e.g. matrices, arrays, lists.I never save an R workspace. I use import scripts and data scripts and output any especially large data objects that I don't want to recreate often to files. This way I always start with a fresh workspace and don't need to clean out large objects. That is a very nice function though.