Scrapy - Reactor not Restartable [duplicate]

with:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

I've always ran this process sucessfully:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()

but since I've moved this code into a web_crawler(self) function, like so:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2)

and started calling the method using class instantiation, like:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

and running:

test()

I am getting the following error:

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

what is wrong?

标签： python scrapy web-crawler

6条回答

做个烂人

2楼-- · 2020-01-25 06:46

As per the Scrapy documentation, the start() method of the CrawlerProcess class does the following:

"[...] starts a Twisted reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE."

The error you are receiving is being thrown by Twisted, because a Twisted reactor cannot be restarted. It uses a ton of globals, and even if you do jimmy-rig some sort of code to restart it (I've seen it done), there's no guarantee it will work.

Honestly, if you think you need to restart the reactor, you're likely doing something wrong.

Depending on what you want to do, I would also review the Running Scrapy from a Script portion of the documentation, too.

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2020-01-25 07:00

The mistake is in this code:

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler() returns two results, and for that purpose it is trying to start the process twice, restarting the Reactor, as pointed by @Rejected.

obtaining results running one single process, and storing both results in a tuple, is the way to go here:

def __call__(self):
    result1, result2 = test.web_crawler()

0人赞添加讨论(0) 举报

Lonely孤独者°

4楼-- · 2020-01-25 07:02

This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()

I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
Hope this will help somebody, as it helped for me :)

0人赞添加讨论(0) 举报

三岁会撩人

5楼-- · 2020-01-25 07:04

As some people pointed out already: You shouldn't need to restart the reactor.

Ideally if you want to chain your processes (crawl1 then crawl2 then crawl3) you simply add callbacks.

For example, I've been using this loop spider that follows this pattern:

1. Crawl A
2. Sleep N
3. goto 1

And this is how it looks in scrapy:

import time

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.body)

def sleep(_, duration=5):
    print(f'sleeping for: {duration}')
    time.sleep(duration)  # block here


def crawl(runner):
    d = runner.crawl(HttpbinSpider)
    d.addBoth(sleep)
    d.addBoth(lambda _: crawl(runner))
    return d


def loop_crawl():
    runner = CrawlerRunner(get_project_settings())
    crawl(runner)
    reactor.run()


if __name__ == '__main__':
    loop_crawl()

To explain the process more the crawl function schedules a crawl and adds two extra callbacks that are being called when crawling is over: blocking sleep and recursive call to itself (schedule another crawl).

$ python endless_crawl.py 
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5

0人赞添加讨论(0) 举报

三岁会撩人

6楼-- · 2020-01-25 07:05

This solved my problem,put below code after reactor.run() or process.start():

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)

0人赞添加讨论(0) 举报

可以哭但决不认输i

7楼-- · 2020-01-25 07:11

You cannot restart the reactor, but you should be able to run it more times by forking a separate process:

import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

Run it twice:

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

Result:

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

0人赞添加讨论(0) 举报

Scrapy - Reactor not Restartable [duplicate]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间