I am using elementtree.ElementTree.iterparse to parse a large (371 MB) xml file.
My code is basically this:
outf = open('out.txt', 'w')
context = iterparse('copyright.xml')
context = iter(context)
dummy, root = context.next()
for event, elem in context:
if elem.tag == 'foo':
author = elem.text
elif elem.tag == 'bar':
if elem.text is not None and 'bat' in elem.text.lower():
outf.write(elem.text + '\n')
elem.clear() #line A
root.clear() #line B
My question is two-fold:
First - Do I need both A and B (see code snippet comments)? I was told that root.clear() clears unnecessary children so memory isn't devoured, but here are my observations: using B and not A is the same as using neither in terms of memory consumption (plotted with task manager). Using only A seems to be the same as using both.
Second - Why is this still consuming so much memory? As the program runs, it uses about 100 MB of RAM near the end.
I assume it has something to do with outf, but why? Isn't it just writing to disk? And if it is storing that data before outf closes, how can I avoid that?
Other information: I am using Python 2.7.3 on Windows.
(The code as posted, with the second line indented, should not run.) http://bugs.python.org/issue14762 was a similar issue and the answer there is that you should clear each element (line A). Without seeing what outf is (or the code that created it), it is hard to answer the second question. If it were a StringIO object, the answer would be obvious. You might take a look at the tutorial linked in the second message of the tracker issue:
http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/
Use
xml.etree.cElementTree.iterparse()
instead [in Python 2.x].Life's too short to debug other people's bugs.