Parsing a partial XML with python lxml

2019-02-15 08:50发布

问题:

I'm trying to parse a large XML file which is being received from the network in Python.

In order to do that, I get the data and pass it to lxml.etree.iterparse

However, if the XML has yet to fully be sent, like so:

<MyXML>
    <MyNode foo="bar">
    <MyNode foo="ba

If I run etree.iterparse(f, tag='MyNode').next() I get an XMLSyntaxError at whereever it was cut off.

Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part of the document? (To make lxml really 'stream' the contents and not read the whole thing in the beginning).

回答1:

XMLPullParser and HTMLPullParser may better suite your needs. They get their data by repeated calls to parser.feed(data). You still have to wait until all of the data comes in before the tree is usable.



回答2:

Try to learn from the answers of two related questions to your problem. Find more wisdom in more related answers. Your problem is very common, may be you need to tweak it a bit to fit into a proven solution. Prefer that way to create a stable solution.

  • using lxml and iterparse() to parse a big (+- 1Gb) XML file
  • Parsing Large XML file with Python lxml and Iterparse


标签: python xml lxml