Parsing Large XML file with Python lxml and Iterpa

2019-02-19 17:51发布

问题:

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
  <url>
     <item>http://www.url1.com</item>
  </url>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
  <url>
     <item>http://www.url2.com</item>
  </url>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )
      elem.clear( )
      while elem.getprevious( ) is not None :
            del elem.getparent( )[0]

del context

When I run it, I get something similar to:

[]
['description1']
[]
['description2']

The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. My hope was to parse out each of the items 1 by 1 and then process the child fields as required. I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered?

回答1:

The entire xml is parsed anyway by the core implementation. The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html). If you want a complex filtering you should do by it's own.

A solution: registering for start event also:

iterparse(self, source, events=("start", "end",), tag="item")

and have a bool to know when you are at the "item" end, when you are the "item/url/item" end.