I have Python code for parsing an XML file as detailed here. I understand that XML files are notorious for hogging system resources when manipulated in memory. My solution works for smaller XML files (say 200KB and I have a 340MB file).
I started researching StAX (pull parser) implementation but I am running on a tight schedule and I am looking for a much simpler approach for this task.
I understand the creation of smaller chunks of files but how do I extract the right elements by outputting the main/header tags every time?
For instance, this is the schema :
<?xml version="1.0" ?>
<!--Sample XML Document-->
<bookstore>
<book Id="1">
....
....
</book>
<book Id="2">
....
....
</book>
<book Id="3">
....
....
</book>
....
....
....
<book Id="n">
....
....
</book>
</bookstore>
How do I create new XML files with header data for every 1000 book elements? For a concrete example of the code and data set, please refer to my other question here. Thanks a lot.
All I want to do is avoid in-memory loading of the dataset all at once. Can we parse the XML file in a streaming fashion? Am I thinking along the right lines?
p.s : My situation is similar to a question asked in 2009. Will post an answer here once I find a simpler solution for my problem. Your feedback is appreciated.
You can parse your big XML file incrementally:
You can use elementtree.iterparse and discard each book tag after it is processed.