Scrapy: Limit the number of request or request byt

2019-04-07 07:10发布

I am using a scrapy CrawlSpider and defined a twisted reactor to control my crawler. During the tests I crawled a news site collecting more than several GBs of data. Mostly I am interested in the newest stories so I am looking for a way to limit the number of requested pages, bytes or seconds.

Is there a common way to define a limit of

  • request_bytes
  • request_counts or
  • run time in seconds?

标签: python scrapy
1条回答
爷、活的狠高调
2楼-- · 2019-04-07 07:56

In scrapy there is the class scrapy.contrib.closespider.CloseSpider. You can define the variables CLOSESPIDER_TIMEOUT, CLOSESPIDER_ITEMCOUNT, CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ERRORCOUNT.

The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider

查看更多
登录 后发表回答