Tricks to manage the available memory in an R sess

2018-12-31 04:20发布

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-help list in 2004] to list (and/or sort) the largest objects and to occassionally rm() some of them. But by far the most effective solution was ... to run under 64-bit Linux with ample memory.

Any other nice tricks folks want to share? One per post, please.

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

25条回答
其实,你不懂
2楼-- · 2018-12-31 05:15

Saw this on a twitter post and think it's an awesome function by Dirk! Following on from JD Long's answer, I would do this for user friendly reading:

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
                           format(utils::object.size(x), units = "auto") })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}

# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

Which results in something like the following:

                      Type   Size PrettySize Length/Rows Columns
pca.res                 PCA 790128   771.6 Kb          7      NA
DF               data.frame 271040   264.7 Kb        669      50
factor.AgeGender   factanal  12888    12.6 Kb         12      NA
dates            data.frame   9016     8.8 Kb        669       2
sd.                 numeric   3808     3.7 Kb         51      NA
napply             function   2256     2.2 Kb         NA      NA
lsos               function   1944     1.9 Kb         NA      NA
load               loadings   1768     1.7 Kb         12       2
ind.sup             integer    448  448 bytes        102      NA
x                 character     96   96 bytes          1      NA

NOTE: The main part I added was (again, adapted from JD's answer) :

obj.prettysize <- napply(names, function(x) {
                           print(object.size(x), units = "auto") })
查看更多
有味是清欢
3楼-- · 2018-12-31 05:15

Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%. When you read stuff into R with for example read.csv they require a certain amount of memory. After this you can save them with save("Destinationfile",list=ls()) The next time you open R you can use load("Destinationfile") Now the memory usage might have decreased. It would be nice if anyone could confirm whether this produces similar results with a different dataset.

查看更多
人间绝色
4楼-- · 2018-12-31 05:16

You also can get some benefit using knitr and puting your script in Rmd chuncks.

I usually divide the code in different chunks and select which one will save a checkpoint to cache or to a RDS file, and

Over there you can set a chunk to be saved to "cache", or you can decide to run or not a particular chunk. In this way, in a first run you can process only "part 1", another execution you can select only "part 2", etc.

Example:

part1
```{r corpus, warning=FALSE, cache=TRUE, message=FALSE, eval=TRUE}
corpusTw <- corpus(twitter)  # build the corpus
```
part2
```{r trigrams, warning=FALSE, cache=TRUE, message=FALSE, eval=FALSE}
dfmTw <- dfm(corpusTw, verbose=TRUE, removeTwitter=TRUE, ngrams=3)
```

As a side effect, this also could save you some headaches in terms of reproducibility :)

查看更多
永恒的永恒
5楼-- · 2018-12-31 05:17

The use of environments instead of lists to handle collections of objects which occupy a significant amount of working memory.

The reason: each time an element of a list structure is modified, the whole list is temporarily duplicated. This becomes an issue if the storage requirement of the list is about half the available working memory, because then data has to be swapped to the slow hard disk. Environments, on the other hand, aren't subject to this behaviour and they can be treated similar to lists.

Here is an example:

get.data <- function(x)
{
  # get some data based on x
  return(paste("data from",x))
}

collect.data <- function(i,x,env)
{
  # get some data
  data <- get.data(x[[i]])
  # store data into environment
  element.name <- paste("V",i,sep="")
  env[[element.name]] <- data
  return(NULL)  
}

better.list <- new.env()
filenames <- c("file1","file2","file3")
lapply(seq_along(filenames),collect.data,x=filenames,env=better.list)

# read/write access
print(better.list[["V1"]])
better.list[["V2"]] <- "testdata"
# number of list elements
length(ls(better.list))

In conjunction with structures such as big.matrix or data.table which allow for altering their content in-place, very efficient memory usage can be achieved.

查看更多
人气声优
6楼-- · 2018-12-31 05:17

Based on @Dirk's and @Tony's answer I have made a slight update. The result was outputting [1] before the pretty size values, so I took out the capture.output which solved the problem:

.ls.objects <- function (pos = 1, pattern, order.by,
                     decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
    fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
    format(utils::object.size(x),  units = "auto") })
obj.size <- napply(names, utils::object.size)

obj.dim <- t(napply(names, function(x)
    as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
if (!missing(order.by))
    out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
    out <- head(out, n)

return(out)
}

# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()
查看更多
美炸的是我
7楼-- · 2018-12-31 05:19

For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g., rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.

dfinal <- NULL
first <- TRUE
tempfile <- "dfinal_temp.csv"
for( i in bigloop ) {
    if( !i %% 10000 ) { 
        print( i, "; flushing to disk..." )
        write.table( dfinal, file=tempfile, append=!first, col.names=first )
        first <- FALSE
        dfinal <- NULL   # nuke it
    }

    # ... complex operations here that add data to 'dfinal' data frame  
}
print( "Loop done; flushing to disk and re-reading entire data set..." )
write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )
dfinal <- read.table( tempfile )
查看更多
登录 后发表回答