with:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
I've always ran this process sucessfully:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
but since I've moved this code into a web_crawler(self)
function, like so:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
and started calling the method using class instantiation, like:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
and running:
test()
I am getting the following error:
Traceback (most recent call last):
File "test.py", line 573, in <module>
print (test())
File "test.py", line 530, in __call__
artists = test.web_crawler()
File "test.py", line 438, in web_crawler
process.start()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
what is wrong?
As per the Scrapy documentation, the
start()
method of theCrawlerProcess
class does the following:The error you are receiving is being thrown by
Twisted
, because a Twisted reactor cannot be restarted. It uses a ton of globals, and even if you do jimmy-rig some sort of code to restart it (I've seen it done), there's no guarantee it will work.Honestly, if you think you need to restart the reactor, you're likely doing something wrong.
Depending on what you want to do, I would also review the Running Scrapy from a Script portion of the documentation, too.
The mistake is in this code:
web_crawler()
returns two results, and for that purpose it is trying to start the process twice, restarting the Reactor, as pointed by @Rejected.obtaining results running one single process, and storing both results in a tuple, is the way to go here:
This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0)
pip install crochet
1)
import from crochet import setup
2)
setup()
- at the top of the file3) remove 2 lines:
a)
d.addBoth(lambda _: reactor.stop())
b)
reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
This code allows me to select what spider to run just with its name passed to
run_spider
function and after scrapping finishes - select another spider and run it again.Hope this will help somebody, as it helped for me :)
As some people pointed out already: You shouldn't need to restart the reactor.
Ideally if you want to chain your processes (crawl1 then crawl2 then crawl3) you simply add callbacks.
For example, I've been using this loop spider that follows this pattern:
And this is how it looks in scrapy:
To explain the process more the
crawl
function schedules a crawl and adds two extra callbacks that are being called when crawling is over: blocking sleep and recursive call to itself (schedule another crawl).This solved my problem,put below code after
reactor.run()
orprocess.start()
:You cannot restart the reactor, but you should be able to run it more times by forking a separate process:
Run it twice:
Result: