I've implemented a solution that relies on Scrapy to run multiple spiders simultaneously. Based on what I've read here (http://doc.scrapy.org/en/latest/topics/exceptions.html), in order to gracefully signal a spider that it's time to die, I should raise a CloseSpider exception as follows:
from scrapy.exceptions import CloseSpider
class SomeSpider(CrawlSpider):
def parse_items(self, response):
if self.to_be_killed:
raise CloseSpider(reason="Received kill signal")
However, while the code does seem to raise the exception when it hits the exception, requests are still being processed by the spider for a long time. I need it to immediately stop what it's doing.
I realize that Scrapy is built around an asynchronous framework, but is there any way that I can force the spider to shutdown without generating any additional outbound requests?
So I ended up using a hacky solution to bypass the problem. Instead of actually immediately terminating the spider in a way that doesn't play well with the Twisted framework, I wrote DownloaderMiddleware that refuses any request that comes up from a spider that I had requested closed.
So:
NOTE: to_be_killed and active are both flags that I had defined in my spider class and are managed by my own code.