Parsing XML with Python xml.sax: How does one “kee

I need to regularly export XML files from our Administration software.

This is the first time I'm using XML Parsing in Python. The XML with xml.sax isn't terribly difficult, but what is the best way to "keep track" of where in the XML tree you are?

For example, I have a list of our customers. I want to extract the Telephone by , but there are multiple places where occurs:

eExact -> Accounts -> Account -> Contacts -> Contact -> Addresses -> Address  -> Phone
eExact -> Accounts -> Account -> Contacts -> Contact -> Phone
eExact -> Accounts -> Account -> Phone

So I need to keep to keep track of where exactly in the XML tree I am in order to get the right Phone number

As far as I can figure out from the xml.sax documentation at the Python website, there is no "easy" method or variable that is set.

So, this is what I did:

import xml.sax

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    self.curpath.append(name)

    if name == 'Phone':
      print self.curpath, name

  def endElement(self, name):
    self.curpath.pop()

if __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = Exact()
  parser.setContentHandler(handler)
  parser.parse(open('/home/cronuser/xml/mount/daily/debtors.xml'))

This is not very difficult, but since I don't have a lot of experience with XML I wonder if this the "generally accepted" or "best possible" manner?

Thanks :)

标签： python xml sax

4条回答

叼着烟拽天下

2楼-- · 2020-07-24 03:49

Thanks or all the comments.

I looked at ElementTree's iterparse, but by then I already did quite some code in xml.sax. Since the immediate advantage of iterparse is small to nonexistent I choose to just stick with xml.sax. It's already a large advantage over the current solution.

Alright, so this is what I did in the end.

class Exact(xml.sax.handler.ContentHandler):
    def __init__(self, stdpath):
        self.stdpath = stdpath

        self.thisrow = {}
        self.curpath = []
        self.getvalue = None

        self.conn = MySQLConnect()
        self.table = None
        self.numrows = 0

    def __del__(self):
        self.conn.close()

        print '%s rows affected' % self.numrows

    def startElement(self, name, att):
        self.curpath.append(name)

    def characters(self, data):
        if self.getValue is not None:
            self.thisrow[self.getValue.strip()] = data.strip()
            self.getValue = None

    def endElement(self, name):
        self.curpath.pop()

        if name == self.stdpath[len(self.stdpath) - 1]:
            self.EndRow()
            self.thisrow = { }

    def EndRow(self):
        self.numrows += MySQLInsert(self.conn, self.thisrow, True, self.table)
        #for k, v in self.thisrow.iteritems():
        #   print '%s: %s,' % (k, v),
        #print ''

    def curPath(self, full=False):
        if full:
            return ' > '.join(self.curpath)
        else:
            return ' > '.join(self.curpath).replace(' > '.join(self.stdpath) + ' > ', '')

I then subclass this a number of times for different XML files:

class Debtors(sqlimport.Exact):
    def startDocument(self):
        self.table = 'debiteuren'
        self.address = None

    def startElement(self, name, att):
        sqlimport.Exact.startElement(self, name, att)

        if self.curPath(True) == ' > '.join(self.stdpath):
            self.thisrow = {}
            self.thisrow['debiteur'] = att.get('code').strip()
        elif self.curPath() == 'Name':
            self.getValue = 'naam'
        elif self.curPath() == 'Phone':
            self.getValue = 'telefoon1'
        elif self.curPath() == 'ExtPhone':
            self.getValue = 'telefoon2'
        elif self.curPath() == 'Contacts > Contact > Addresses > Address':
            if att.get('type') == 'V':
                self.address = 'Contacts > Contact > Addresses > Address '
        elif self.address is not None:
            if self.curPath() == self.address + '> AddressLine1':
                self.getValue = 'adres1'
            elif self.curPath() == self.address + '> AddressLine2':
                self.getValue = 'adres2'
        else:
            self.getValue = None

if __name__ == '__main__':
    handler = Debtors(['Debtors', 'Accounts', 'Account'])
    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    parser.parse(open('myfile.xml', 'rb'))

... and so forth ...

0人赞添加讨论(0) 举报

戒情不戒烟

3楼-- · 2020-07-24 03:59

I used sax too, but then I found a better tool: iterparse from ElementTree.

It is similar to sax, but you can retrieve elements with contents, to free the memory you just have to clear element once retrieved.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2020-07-24 03:59

I think the simplest solution is exactly what you are doing in your example - maintain a stack of nodes.

0人赞添加讨论(0) 举报

对你真心纯属浪费

5楼-- · 2020-07-24 04:08

Is there a specific reason you need to use SAX for this?

Because, if loading the entire XML file into an in-memory object model is acceptable, you'd probably find it a lot easier to use the ElementTree DOM API.

(If you don't need the ability to retrieve a parent node when given a child, cElementTree in the Python standard library should do the trick nicely. If you do, the LXML library provides an ElementTree implementation that'll give you parent references. Both use compiled C modules for speed.)

0人赞添加讨论(0) 举报

Parsing XML with Python xml.sax: How does one “kee

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间