I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because right now the RAM consumed is pretty high, maybe something to put in the link like:
"http://site/feed.xml?limit=10", i've searched if there is something similar to this but i haven't found anything.
Another option would be limit the items parsed by scrapy, but i don't know how to do that.Right now once the XMLFeedSpider parsed the whole document the bot will analyze only the first ten items, but i supposes that the whole feed will still be in the memory. Have you any idea on how to improve the bot's performance , diminishing the RAM and CPU consumption? Thanks
You should set the iterator mode of your XMLFeedSpider to
iternodes
(see here):After doing so, you should be able to iterate over your feed and stop at any point.
When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.
For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.
In order to create make use of sax, subclass
xml.sax.saxutils.XMLGenerator
and overriderendElement
,startElement
andcharacters
. Then callxml.sax.parse
with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.