Non-Blocking method for parsing (streaming) XML in

2019-01-24 12:09发布

I have an XML document coming in over a socket that I need to parse and react to on the fly (ie parsing a partial tree). What I'd like is a non blocking method of doing so, so that I can do other things while waiting for more data to come in (without threading).

Something like iterparse would be ideal if it finished iterating when the read buffer was empty, eg:

context = iterparse(imaginary_socket_file_wrapper)
while 1:
    for event, elem in context:
        process_elem(elem)
    # iteration of context finishes when socket has no more data
    do_other_stuff()
    time.sleep(0.1)

I guess SAX would also be an option, but iterparse just seems simpler for my needs. Any ideas?

Update:

Using threads is fine, but introduces a level of complexity that I was hoping to sidestep. I thought that non-blocking calls would be a good way to do so, but I'm finding that it increases the complexity of parsing the XML.

3条回答
小情绪 Triste *
2楼-- · 2019-01-24 12:54

Diving into the iterparse source provided the solution for me. Here's a simple example of building an XML tree on the fly and processing elements after their close tags:

import xml.etree.ElementTree as etree

parser = etree.XMLTreeBuilder()

def end_tag_event(tag):
    node = self.parser._end(tag)
    print node

parser._parser.EndElementHandler = end_tag_event

def data_received(data):
    parser.feed(data)

In my case I ended up feeding it data from twisted, but it should work with a non-blocking socket also.

查看更多
爷的心禁止访问
3楼-- · 2019-01-24 12:58

I think there are two components to this, the non-blocking network I/O, and a stream-oriented XML parser.

For the former, you'd have to pick a non-blocking network framework, or roll your own solution for this. Twisted certainly would work, but I personally find inversion of control frameworks difficult to wrap my brain around. You would likely have to keep track of a lot of state in your callbacks to feed the parser. For this reason I tend to find Eventlet a bit easier to program to, and I think it would fit well in this situation.

Essentially it allows you to write your code as if you were using a blocking socket call (using an ordinary loop or a generator or whatever you like), except that you can spawn it into a separate coroutine (a "greenlet") that will automatically perform a cooperative yield when I/O operations would block, thus allowing other coroutines to run.

This makes using any stream-oriented parser trivial again, because the code is structured like an ordinary blocking call. It also means that many libraries that don't directly deal with sockets or other I/O (like the parser for instance) don't have to be specially modified to be non-blocking: if they block, Eventlet yields the coroutine.

Admittedly Eventlet is slightly magic, but I find it has a much easier learning curve than Twisted, and results in more straightforward code because you don't have to turn your logic "inside out" to fit the framework.

查看更多
男人必须洒脱
4楼-- · 2019-01-24 13:05

If you won't use threads, you can use an event loop and poll non-blocking sockets.

asyncore is the standard library module for such stuff. Twisted is the async library for Python, but complex and probably a bit heavyweight for your needs.

Alternatively, multiprocessing is the non-thread thread alternative, but I assume you aren't running 2.6.

One way or the other, I think you're going to have to use threads, extra processes or weave some equally complex async magic.

查看更多
登录 后发表回答