可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have deployed a scrapy project which crawls whenever an lambda api requests comes.

It runs perfectly for the first api call but later on it fails and throws ReactorNotRestartable error.

As far as I can understand the AWS Lambda ecosystem is not killing the process, hence reactor is still present in the memory.

The lambda log error is as follows:

Traceback (most recent call last):
File "/var/task/aws-lambda.py", line 42, in run_company_details_scrapy
process.start()
File "./lib/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False)  # blocking call
File "./lib/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "./lib/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "./lib/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable

The lambda handler function is:

def run_company_details_scrapy(event, context):
   process = CrawlerProcess()
   process.crawl(CompanyDetailsSpidySpider)
   process.start()

I had a workaround by not stopping the reactor by inserting a flag in the start function

process.start(stop_after_crawl=False)

But the problem with this was that I had to wait until the lambda call timed out.

Tried other solutions, but none of them seems to work.Can anyone guide me how to solve this problem.

回答1:

You could try using https://pypi.python.org/pypi/crochet to coordinate use of a reactor running in a non-main thread from the Lambda-controlled main thread.

Crochet will do the threaded reactor initialization for you and provides tools to make it easy to call code in the reactor thread from the main (and get the results).

This might be more in line with the expectations Lambda has of your code.

回答2:

This problem isn't unique to AWS Lambda - see running a spider in a Celery task.

You might try ScrapyScript (disclosure: I wrote it). It spawns a subprocess to support the Twisted reactor, blocks until all of the supplied spiders have finished, and then exits. It was written with Celery in mind, but the use case is similar.

In your case, this should work:

from scrapyscript import Job, Processor
def run_company_details_scrapy(event, context):
    job = Job(CompanyDetailsSpidySpider())
    Processor().run(job)`

回答3:

Had the same problem recently, and Crochet didn't want to work for various reasons.

Eventually we went for the dirty solution: just call sys.exit(0) (or sys.exit(1) if an error was caught, not that anything looks at the return code AFAICT) at the end of the lambda handler function. This worked perfectly.

Obviously no good if you're intending to return a response from your Lambda, but if you're using Scrapy, data's probably being persisted already via your Pipelines, with a scheduler as the trigger for your Lambda, so no response needed.

Note: you will get a notice from AWS in CloudWatch:

RequestId: xxxx Process exited before completing request

Scrapy throws error ReactorNotRestartable when run

问题:

回答1:

回答2:

回答3:

收藏的人(0)

Scrapy throws error ReactorNotRestartable when run

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮