Are there any XML parsers for Python that can parse file streams? My XML files are too big to fit in memory, so I need to parse the stream.
Ideally I wouldn't have to have root access to install things, so lxml
is not a very good option.
I have been using xml.etree.ElementTree
but I am convinced it is broken.
Are you looking for
xml.sax
? It's right in the standard library.Here's good answer about
xml.etree.ElementTree.iterparse
practice on huge XML files.lxml
has the method as well. The key to stream parsing withiterparse
is manual clearing and removing already processed nodes, because otherwise you will end up running out of memory.Another option is using
xml.sax
. The official manual is too formal to me, and lacks examples so it needs clarification along with the question. Default parser module,xml.sax.expatreader
, implement incremental parsing interfacexml.sax.xmlreader.IncrementalParser
. That is to sayxml.sax.make_parser()
provides suitable stream parser.For instance, given a XML stream like:
Can be handled in the following way.
Use
xml.etree.cElementTree
. It's much faster thanxml.etree.ElementTree
. Neither of them are broken. Your files are broken (see my answer to your other question).