How to split an XML file the simple way in Python?

I have Python code for parsing an XML file as detailed here. I understand that XML files are notorious for hogging system resources when manipulated in memory. My solution works for smaller XML files (say 200KB and I have a 340MB file).

I started researching StAX (pull parser) implementation but I am running on a tight schedule and I am looking for a much simpler approach for this task.

I understand the creation of smaller chunks of files but how do I extract the right elements by outputting the main/header tags every time?

For instance, this is the schema :

<?xml version="1.0" ?>
<!--Sample XML Document-->
<bookstore>
    <book Id="1">
      ....
      ....
    </book> 
    <book Id="2">
      ....
      ....
    </book> 
    <book Id="3">
      ....
      ....
    </book> 
    ....
    ....
    ....
    <book Id="n">
      ....
      ....
    </book> 
</bookstore>

How do I create new XML files with header data for every 1000 book elements? For a concrete example of the code and data set, please refer to my other question here. Thanks a lot.

All I want to do is avoid in-memory loading of the dataset all at once. Can we parse the XML file in a streaming fashion? Am I thinking along the right lines?

p.s : My situation is similar to a question asked in 2009. Will post an answer here once I find a simpler solution for my problem. Your feedback is appreciated.

标签： xml split python

2条回答

再贱就再见

2楼-- · 2020-06-04 13:48

You can parse your big XML file incrementally:

from xml.etree.cElementTree import iterparse

# get an iterable and turn it into an iterator
context = iter(iterparse("path/to/big.xml", events=("start", "end")))

# get the root element
event, root = next(context)
assert event == "start"

for event, elem in context:
    if event == "end" and elem.tag == "book":
       # ... process book elements ...
       root.clear()

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2020-06-04 13:53

You can use elementtree.iterparse and discard each book tag after it is processed.

0人赞添加讨论(0) 举报

How to split an XML file the simple way in Python?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间