I am trying to run some simple program to extract tables from html code. However, there seems to be some memory issue with readHTMLTable in XML package. Is there any way I could just work around this easily. Like somehow specifying some special memory for this command and then freeing it manually.
I have tried to put this in a function and tried to use gc() and different versions of R and this package and nothing seems to work. I start to get desperate.
Example code. How to run this without exploding memory size?
library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
while(TRUE) {
b = readHTMLTable(a)
#do something with b
}
Edit: Something like this still takes all of my memory:
library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
f <- function(x) {
b = readHTMLTable(x)
rm(x)
gc()
return(b)
}
for(i in 1:100) {
d = f(a)
rm(d)
gc()
}
rm(list=ls())
gc()
I am using win 7 and tried with 32bit and 64bit.
Same problem here, even doing nothing more than reading in the document with
doc <- xmlParse(...); root <- xmlRoot(doc)
, the memory allocated todoc
is just never released to the O/S (as monitored in Windows' Task Manager).A crazy idea that we might try is to employ
system("Rscript ...")
to perform the XML parsing in a separate R session, saving the parsed R object to a file, which we then read in in the main R session. Hacky but it would at least ensure that whatever memory is gobbled up by the XML parsing, is released when the Rscript session terminates and doesn't affect the main process!I had a lot of problems with memory leaks in the XML pakackage too (under both windows and linux), but the way I solved it eventually was to remove the object at the end of each processing step, i.e. add a rm(b) and a gc() at the end of each iteration. Let me know if this works for you too.
As of XML 3.98-1.4 and R 3.1 on Win7, this problem can be solved perfectly by using the function
free()
. But it does not work withreadHTMLTable()
. The following code works perfectly.The xml2 package has similar issues and the memory can be released by using the function
remove_xml()
followed bygc()
.