This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule
to 'smaller' tags but that didn't make a difference.
What am I doing wrong / how can I process this large file with iterparse()
?
import lxml.etree
for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
print "why does this consume all my memory?"
I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.
This worked really well for me:
As
iterparse
iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.In order to free some memory as you parse, use Liza Daly's
fast_iter
:which you could then use like this:
I highly recommend the article on which the above
fast_iter
is based; it should be especially interesting to you if you are dealing with large XML files.The
fast_iter
presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the difference.Directly copied from http://effbot.org/zone/element-iterparse.htm
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable: