Parsing Large XML file with Python lxml and Iterpa

2019-02-19 17:42发布

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
  <url>
     <item>http://www.url1.com</item>
  </url>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
  <url>
     <item>http://www.url2.com</item>
  </url>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )
      elem.clear( )
      while elem.getprevious( ) is not None :
            del elem.getparent( )[0]

del context

When I run it, I get something similar to:

[]
['description1']
[]
['description2']

The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. My hope was to parse out each of the items 1 by 1 and then process the child fields as required. I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered?

标签： python xml lxml large-files iterparse

1条回答

姐就是有狂的资本

2楼-- · 2019-02-19 18:29

The entire xml is parsed anyway by the core implementation. The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html). If you want a complex filtering you should do by it's own.

A solution: registering for start event also:

iterparse(self, source, events=("start", "end",), tag="item")

and have a bool to know when you are at the "item" end, when you are the "item/url/item" end.

0人赞添加讨论(0) 举报

Parsing Large XML file with Python lxml and Iterpa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间