Retrieve only a portion of an XML feed

I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because right now the RAM consumed is pretty high, maybe something to put in the link like:

"http://site/feed.xml?limit=10", i've searched if there is something similar to this but i haven't found anything.

Another option would be limit the items parsed by scrapy, but i don't know how to do that.Right now once the XMLFeedSpider parsed the whole document the bot will analyze only the first ten items, but i supposes that the whole feed will still be in the memory. Have you any idea on how to improve the bot's performance , diminishing the RAM and CPU consumption? Thanks

标签： python xml web-scraping scrapy

2条回答

贪生不怕死

2楼-- · 2019-07-23 20:26

You should set the iterator mode of your XMLFeedSpider to iternodes (see here):

It’s recommended to use the iternodes iterator for performance reasons

After doing so, you should be able to iterate over your feed and stop at any point.

0人赞添加讨论(0) 举报

冷血范

3楼-- · 2019-07-23 20:32

When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.

SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.

In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator and overrider endElement, startElement and characters. Then call xml.sax.parse with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.

0人赞添加讨论(0) 举报

Retrieve only a portion of an XML feed

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间