I am a python newbie. I am trying to parse a huge xml file in my python module using lxml. In spite of clearing the elements at the end of each loop, my memory shoots up and crashes the application. I am sure I am missing something here. Please helpme figure out what that is.
Following are main functions I am using -
from lxml import etree
def parseXml(context,attribList):
for _, element in context:
fieldMap={}
rowList=[]
readAttribs(element,fieldMap,attribList)
readAllChildren(element,fieldMap,attribList)
for row in rowList:
yield row
element.clear()
def readAttribs(element,fieldMap,attribList):
for atrrib in attribList:
fieldMap[attrib]=element.get(attrib,'')
def readAllChildren(element,fieldMap,attribList,rowList):
for childElem in element:
readAttribs(childEleme,fieldMap,attribList)
if len(childElem) > 0:
readAllChildren(childElem,fieldMap,attribList)
rowlist.append(fieldMap.copy())
childElem.clear()
def main():
attribList=['name','age','id']
context=etree.iterparse(fullFilePath, events=("start",))
for row in parseXml(context,attribList)
print row
Thanks!!
Example xml and the nested dictionary -
<root xmlns='NS'>
<Employee Name="Mr.ZZ" Age="30">
<Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
<Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
<Project Name="ABC_1" Team="4">
</Project>
</Employment>
<Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
<PromotionStatus>Manager</PromotionStatus>
<Project Name="XYZ_1" Team="7">
<Award>Star Team Member</Award>
</Project>
</Employment>
</Experience>
</Employee>
</root>
ELEMENT_NAME='element_name'
ELEMENTS='elements'
ATTRIBUTES='attributes'
TEXT='text'
xmlDef={ 'namespace' : 'NS',
'content' :
{ ELEMENT_NAME: 'Employee',
ELEMENTS: [{ELEMENT_NAME: 'Experience',
ELEMENTS: [{ELEMENT_NAME: 'Employment',
ELEMENTS: [{
ELEMENT_NAME: 'PromotionStatus',
ELEMENTS: [],
ATTRIBUTES:[],
TEXT:['PromotionStatus']
},
{
ELEMENT_NAME: 'Project',
ELEMENTS: [{
ELEMENT_NAME: 'Award',
ELEMENTS: {},
ATTRIBUTES:[],
TEXT:['Award']
}],
ATTRIBUTES:['Name','Team'],
TEXT:[]
}],
ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
TEXT:[]
}],
ATTRIBUTES: ['TotalYears','StartDate','EndDate'],
TEXT:[]
}],
ATTRIBUTES: ['Name','Age'],
TEXT:[]
}
}
Welcome to Python and Stack Overflow!
It looks like you've followed some good advice looking at
lxml
and especiallyetree.iterparse(..)
, but I think your implementation is approaching the problem from the wrong angle. The idea ofiterparse(..)
is to get away from collecting and storing data, and instead processing tags as they get read in. YourreadAllChildren(..)
function is saving everything torowList
, which grows and grows to cover the whole document tree. I made a few changes to show what's going on:Running with some dummy data:
It's a little hard to read but you can see it's climbing the whole tree down from the root tag on the first pass, building up
rowList
for every element in the entire document. You'll also notice it's not even stopping there, since theelement.clear()
call comes after theyield
statment inparseXml(..)
, it doesn't get executed until the second iteration (i.e. the next element in the tree).Incremental processing FTW
A simple fix is to let
iterparse(..)
do its job: parse iteratively! The following will pull the same information and process it incrementally instead:Running on the same dummy XML:
This should greatly improve both the speed and memory performance of your script. Also, by hooking the
'end'
event, you're free to clear and delete elements as you go, rather than waiting until all children have been processed.Depending on your dataset, it might be a good idea to only process certain types of elements. The root element, for one, probably isn't very meaningful, and other nested elements may also fill your dataset with a lot of
{'age': u'', 'id': u'', 'name': u''}
.Or, with SAX
As an aside, when I read "XML" and "low-memory" my mind always jumps straight to SAX, which is another way you could attack this problem. Using the builtin
xml.sax
module:You'll have to evaluate both options based on what works best in your situation (and maybe run a couple benchmarks, if this is something you'll be doing often).
Be sure to follow up with how things work out!
Edit based on follow-up comments
Implementing either of the above solutions may require some changes to the overall structure of your code, but anything you have should still be doable. For instance, processing "rows" in batches, you could have: