Workaround to R memory leak with XML package

2019-01-20 11:39发布

问题:

I am trying to run some simple program to extract tables from html code. However, there seems to be some memory issue with readHTMLTable in XML package. Is there any way I could just work around this easily. Like somehow specifying some special memory for this command and then freeing it manually.

I have tried to put this in a function and tried to use gc() and different versions of R and this package and nothing seems to work. I start to get desperate.

Example code. How to run this without exploding memory size?

library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
while(TRUE) {
    b = readHTMLTable(a)
    #do something with b
}

Edit: Something like this still takes all of my memory:

library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
f <- function(x) {
    b = readHTMLTable(x)
    rm(x)
    gc()
    return(b)
}

for(i in 1:100) {
    d = f(a)
    rm(d)
    gc()
}
rm(list=ls())
gc()

I am using win 7 and tried with 32bit and 64bit.

回答1:

As of XML 3.98-1.4 and R 3.1 on Win7, this problem can be solved perfectly by using the function free(). But it does not work with readHTMLTable(). The following code works perfectly.

library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
while(TRUE){
   b = xmlParse(paste(a, collapse = ""))
   #do something with b
   free(b)
}

The xml2 package has similar issues and the memory can be released by using the function remove_xml() followed by gc().



回答2:

I had a lot of problems with memory leaks in the XML pakackage too (under both windows and linux), but the way I solved it eventually was to remove the object at the end of each processing step, i.e. add a rm(b) and a gc() at the end of each iteration. Let me know if this works for you too.



回答3:

Same problem here, even doing nothing more than reading in the document with doc <- xmlParse(...); root <- xmlRoot(doc), the memory allocated to doc is just never released to the O/S (as monitored in Windows' Task Manager).

A crazy idea that we might try is to employ system("Rscript ...") to perform the XML parsing in a separate R session, saving the parsed R object to a file, which we then read in in the main R session. Hacky but it would at least ensure that whatever memory is gobbled up by the XML parsing, is released when the Rscript session terminates and doesn't affect the main process!