I am using a scrapy CrawlSpider
and defined a twisted reactor to control my crawler. During the tests I crawled a news site collecting more than several GBs of data. Mostly I am interested in the newest stories so I am looking for a way to limit the number of requested pages, bytes or seconds.
Is there a common way to define a limit of
- request_bytes
- request_counts or
- run time in seconds?
In
scrapy
there is the classscrapy.contrib.closespider.CloseSpider
. You can define the variablesCLOSESPIDER_TIMEOUT
,CLOSESPIDER_ITEMCOUNT
,CLOSESPIDER_PAGECOUNT
andCLOSESPIDER_ERRORCOUNT
.The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider