Scrapy: Limit the number of request or request byt

2019-04-07 07:10发布

I am using a scrapy CrawlSpider and defined a twisted reactor to control my crawler. During the tests I crawled a news site collecting more than several GBs of data. Mostly I am interested in the newest stories so I am looking for a way to limit the number of requested pages, bytes or seconds.

Is there a common way to define a limit of

request_bytes
request_counts or
run time in seconds?

标签： python scrapy

1条回答

爷、活的狠高调

2楼-- · 2019-04-07 07:56

In scrapy there is the class scrapy.contrib.closespider.CloseSpider. You can define the variables CLOSESPIDER_TIMEOUT, CLOSESPIDER_ITEMCOUNT, CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ERRORCOUNT.

The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider

0人赞添加讨论(0) 举报

Scrapy: Limit the number of request or request byt

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间