Scrapy not responding to CloseSpider exception

2019-07-22 01:45发布

I've implemented a solution that relies on Scrapy to run multiple spiders simultaneously. Based on what I've read here (http://doc.scrapy.org/en/latest/topics/exceptions.html), in order to gracefully signal a spider that it's time to die, I should raise a CloseSpider exception as follows:

from scrapy.exceptions import CloseSpider

class SomeSpider(CrawlSpider):
  def parse_items(self, response):
     if self.to_be_killed:
        raise CloseSpider(reason="Received kill signal")

However, while the code does seem to raise the exception when it hits the exception, requests are still being processed by the spider for a long time. I need it to immediately stop what it's doing.

I realize that Scrapy is built around an asynchronous framework, but is there any way that I can force the spider to shutdown without generating any additional outbound requests?

标签: python scrapy
1条回答
我命由我不由天
2楼-- · 2019-07-22 02:27

So I ended up using a hacky solution to bypass the problem. Instead of actually immediately terminating the spider in a way that doesn't play well with the Twisted framework, I wrote DownloaderMiddleware that refuses any request that comes up from a spider that I had requested closed.

So:

from scrapy import log
from scrapy.exceptions import IgnoreRequest

class SpiderStatusMiddleware:

    def process_request(self, request, spider):
        if spider.to_be_killed or not spider.active:
            log.msg("Spider has been killed, ignoring request to %s" % request.url, log.DEBUG, spider=spider)
            raise IgnoreRequest()

        return None

NOTE: to_be_killed and active are both flags that I had defined in my spider class and are managed by my own code.

查看更多
登录 后发表回答