I have some code which looks something like this:
def run(spider_name, settings):
runner = CrawlerProcess(settings)
runner.crawl(spider_name)
runner.start()
return True
I have two py.test tests which each call run(), when the second test executes I get the following error.
runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
reactor.run(installSignalHandlers=False) # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>
def startRunning(self):
"""
Method called when reactor starts: do some initialization and fire
startup events.
Don't call this directly, call reactor.run() instead: it should take
care of calling this.
This method is somewhat misnamed. The reactor will not necessarily be
in the running state by the time this method returns. The only
guarantee is that it will be on its way to the running state.
"""
if self._started:
raise error.ReactorAlreadyRunning()
if self._startedBefore:
> raise error.ReactorNotRestartable()
E twisted.internet.error.ReactorNotRestartable
I get this reactor thing is already running so I cannot runner.start()
when the second test runs. But is there some way to reset its state inbetween the tests? So they are more isolated and actually can run after one another.
If you use CrawlerRunner
instead of CrawlerProcess
in conjunction with pytest-twisted
, you should be able to use run your tests like this:
Install Twisted integration for Pytest: pip install pytest-twisted
from scrapy.crawler import CrawlerRunner
def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)
@deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
@deferred.addErrback
def _error(failure):
raise failure.value
return deferred
To put it plainly, _run_crawler()
will schedule a crawl in the Twisted reactor and execute callbacks when the scrape completes. In those callbacks (_success()
and _error()
) is where you will do your assertions. Lastly, you have to return the Deferred
object from _run_crawler()
so that the test waits until the crawl is complete. This part with the Deferred
, is essential and must be done for all tests.
Here's an example of how to run multiple crawls and aggregate results using gatherResults
.
from twisted.internet import defer
def test_multiple_crawls():
d1 = _run_crawler(Spider1, settings)
d2 = _run_crawler(Spider2, settings)
d_list = defer.gatherResults([d1, d2])
@d_list.addCallback
def _success(results):
assert True
@d_list.addErrback
def _error(failure):
assert False
return d_list
I hope this helps, if it doesn't please ask where you're struggling.
According to the scrapy docs:
By default, Scrapy runs a single spider per process when you run
scrapy crawl. However, Scrapy supports running multiple spiders per
process using the internal API.
For example:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
If you want to run another spider after you've called process.start
then I expect you can just issue another process.crawl(SomeSpider)
call at the point in your program where you determine the need to do this.
Examples of other scenarios are given in the docs.