Why is elementtree.ElementTree.iterparse using so

I am using elementtree.ElementTree.iterparse to parse a large (371 MB) xml file.

My code is basically this:

outf = open('out.txt', 'w') 
context = iterparse('copyright.xml')
context = iter(context)
dummy, root = context.next()

for event, elem in context:
    if elem.tag == 'foo':
        author = elem.text

    elif elem.tag == 'bar':
        if elem.text is not None and 'bat' in elem.text.lower():
            outf.write(elem.text + '\n')
    elem.clear()   #line A
    root.clear()   #line B

My question is two-fold:

First - Do I need both A and B (see code snippet comments)? I was told that root.clear() clears unnecessary children so memory isn't devoured, but here are my observations: using B and not A is the same as using neither in terms of memory consumption (plotted with task manager). Using only A seems to be the same as using both.

Second - Why is this still consuming so much memory? As the program runs, it uses about 100 MB of RAM near the end.

I assume it has something to do with outf, but why? Isn't it just writing to disk? And if it is storing that data before outf closes, how can I avoid that?

Other information: I am using Python 2.7.3 on Windows.

标签： python xml memory elementtree iterparse

2条回答

成全新的幸福

2楼-- · 2019-02-20 15:21

(The code as posted, with the second line indented, should not run.) http://bugs.python.org/issue14762 was a similar issue and the answer there is that you should clear each element (line A). Without seeing what outf is (or the code that created it), it is hard to answer the second question. If it were a StringIO object, the answer would be obvious. You might take a look at the tutorial linked in the second message of the tracker issue:

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

0人赞添加讨论(0) 举报

甜甜的少女心

3楼-- · 2019-02-20 15:31

Use xml.etree.cElementTree.iterparse() instead [in Python 2.x].

Life's too short to debug other people's bugs.

0人赞添加讨论(0) 举报

Why is elementtree.ElementTree.iterparse using so

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间