The scrapy-redis program does not close automatica

Scrapy-redis framework, redis stored xxx: requests have been crawled finished, but the program is still running, how to automatically stop the program, rather than has been running?
The running code:

2017-08-07 09:17:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-08-07 09:18:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

I use scrapy-redis to crawl a site, scrapy-redis will not automatically shut down, still have to ask url, but has no url. So it will alwaysscraped 0 items (at 0 items/min)

标签： python redis scrapy web-crawler

2条回答

We Are One

2楼-- · 2019-08-18 17:43

Well scrapy-redis is made to be always open waiting for more urls to be pushed in the redis queue, but if you want to close it you could do it with a pipeline, here:

class TestPipeline(object):

def __init__(self, crawler):
    self.crawler = crawler
    self.redis_db = None
    self.redis_len = 0

@classmethod
def from_crawler(cls, crawler):
    return cls(crawler)

def open_spider(self, spider):        
    self.redis_len = len(spider.server.keys('your_redis_key'))

def process_item(self, item, spider):
    self.redis_len -= 1
    if self.redis_len <= 0:
        self.crawler.engine.close_spider(spider, 'No more items in redis queue')

    return item

I will explain how it works in open_spider the pipeline get the total of keys in the redis queue and in process_item it decrements the redis_len variable and when it reach zero send a close signal in the last item.

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

3楼-- · 2019-08-18 17:44

scrapy-redis will always wait for new urls to be pushed in the redis queue. When the queue is empty, the spider goes in idle state and waits new urls. That's what I used to close my spider once the queue is empty.

When the spider is in idle (when it does nothing), I check if there is still something left in the redis queue. If not, I close the spider with close_spider. The following code is located in the spider class:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    from_crawler = super(SerpSpider, cls).from_crawler
    spider = from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
    return spider


def idle(self):
    if self.q.llen(self.redis_key) <= 0:
        self.crawler.engine.close_spider(self, reason='finished')

0人赞添加讨论(0) 举报

The scrapy-redis program does not close automatica

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间