I need to regularly export XML files from our Administration software.
This is the first time I'm using XML Parsing in Python. The XML with xml.sax
isn't terribly difficult, but what is the best way to "keep track" of where in the XML tree you are?
For example, I have a list of our customers. I want to extract the Telephone by , but there are multiple places where occurs:
eExact -> Accounts -> Account -> Contacts -> Contact -> Addresses -> Address -> Phone
eExact -> Accounts -> Account -> Contacts -> Contact -> Phone
eExact -> Accounts -> Account -> Phone
So I need to keep to keep track of where exactly in the XML tree I am in order to get the right Phone number
As far as I can figure out from the xml.sax documentation at the Python website, there is no "easy" method or variable that is set.
So, this is what I did:
import xml.sax
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
self.curpath.append(name)
if name == 'Phone':
print self.curpath, name
def endElement(self, name):
self.curpath.pop()
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = Exact()
parser.setContentHandler(handler)
parser.parse(open('/home/cronuser/xml/mount/daily/debtors.xml'))
This is not very difficult, but since I don't have a lot of experience with XML I wonder if this the "generally accepted" or "best possible" manner?
Thanks :)
Thanks or all the comments.
I looked at ElementTree's iterparse, but by then I already did quite some code in xml.sax. Since the immediate advantage of iterparse is small to nonexistent I choose to just stick with xml.sax. It's already a large advantage over the current solution.
Alright, so this is what I did in the end.
I then subclass this a number of times for different XML files:
... and so forth ...
I used sax too, but then I found a better tool: iterparse from ElementTree.
It is similar to sax, but you can retrieve elements with contents, to free the memory you just have to clear element once retrieved.
I think the simplest solution is exactly what you are doing in your example - maintain a stack of nodes.
Is there a specific reason you need to use SAX for this?
Because, if loading the entire XML file into an in-memory object model is acceptable, you'd probably find it a lot easier to use the ElementTree DOM API.
(If you don't need the ability to retrieve a parent node when given a child, cElementTree in the Python standard library should do the trick nicely. If you do, the LXML library provides an ElementTree implementation that'll give you parent references. Both use compiled C modules for speed.)