Is there a fast XML parser in Python that allows m

I am working with potentially huge XML files containing complex trace information from on of my projects.

I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.

If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.

The part that I have not figured out yet is how to quickly parse the XML and create such an index.

So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.

标签： python xml parsing indexing sax

1条回答

The star\"

2楼-- · 2019-04-10 13:08

Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:

import cStringIO
import re
from xml import sax
from xml.sax import handler

relinend = re.compile(r'\n')

txt = '''<foo>
            <tit>Bar</tit>
        <baz>whatever</baz>
     </foo>'''
stm = cStringIO.StringIO(txt)

class LocatingWrapper(object):
    def __init__(self, f):
        self.f = f
        self.linelocs = []
        self.curoffs = 0

    def read(self, *a):
        data = self.f.read(*a)
        linends = (m.start() for m in relinend.finditer(data))
        self.linelocs.extend(x + self.curoffs for x in linends)
        self.curoffs += len(data)
        return data

    def where(self, loc):
        return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()

locstm = LocatingWrapper(stm)

class Handler(handler.ContentHandler):
    def setDocumentLocator(self, loc):
        self.loc = loc
    def startElement(self, name, attrs):
        print '%s@%s:%s (%s)' % (name, 
                                 self.loc.getLineNumber(),
                                 self.loc.getColumnNumber(),
                                 locstm.where(self.loc))

sax.parse(locstm, Handler())

Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.

0人赞添加讨论(0) 举报

Is there a fast XML parser in Python that allows m

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间