I am using elementtree.ElementTree.iterparse to parse a large (371 MB) xml file.
My code is basically this:
outf = open('out.txt', 'w')
context = iterparse('copyright.xml')
context = iter(context)
dummy, root = context.next()
for event, elem in context:
if elem.tag == 'foo':
author = elem.text
elif elem.tag == 'bar':
if elem.text is not None and 'bat' in elem.text.lower():
outf.write(elem.text + '\n')
elem.clear() #line A
root.clear() #line B
My question is two-fold:
First - Do I need both A and B (see code snippet comments)? I was told that root.clear() clears unnecessary children so memory isn't devoured, but here are my observations: using B and not A is the same as using neither in terms of memory consumption (plotted with task manager). Using only A seems to be the same as using both.
Second - Why is this still consuming so much memory? As the program runs, it uses about 100 MB of RAM near the end.
I assume it has something to do with outf, but why? Isn't it just writing to disk? And if it is storing that data before outf closes, how can I avoid that?
Other information: I am using Python 2.7.3 on Windows.