Calling scrapy from a python script not creating J

2020-02-29 06:25发布

问题:

Here's the python script that i am using to call scrapy, the answer of

Scrapy crawl from script always blocks script execution after scraping

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

here's my pipelines.py code

from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher

class scrapermar11Pipeline(object):


    def __init__(self):
        self.files = {}
        dispatcher.connect(self.spider_opened , signals.spider_opened)
        dispatcher.connect(self.spider_closed , signals.spider_closed)


    def spider_opened(self,spider):
        file = open('links_pipelines.json' ,'wb')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self,spider):
       self.exporter.finish_exporting()
       file = self.files.pop(spider)
       file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        log.msg('It reached here')
        return item

This code is taken from here

Scrapy :: Issues with JSON export

When i run the crawler like this

scrapy crawl MySpider -a start_url='abc'

a links file with the expected output is created .But when i execute the python script it does not create any file though the crawler runs as the dumped scrapy stats are similar to those of the previous run. I think there's a mistake in the python script as the file is getting created in the first approach .How do i get the script to output the file ?

回答1:

This code worked for me:

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."


回答2:

A solution that worked for me was to ditch the run script and use of the internal API and use the command line & GNU Parallel to parallelize instead.

To run all known spiders, one per core:

scrapy list | parallel --line-buffer scrapy crawl

scrapy list lists all spiders one per line, allowed us to pipe them as arguments to append to a command (scrapy crawl) passed to GNU Parallel instead. --line-buffer means that output received back from the processes will be be printed to stdout mixed, but on a line-by-line basis rather than quater/half lines being garbled together (for other options look at --group and --ungroup).

NB: obviously this works best on machines that have multiple CPU cores as by default, GNU Parallel will run one job per core. Note that unlike many modern development machines, the cheap AWS EC2 & DigitalOcean tiers only have one virtual CPU core. Therefore if you wish to run jobs simultaneously on one core you will have to play with the --jobs argument to GNU Parellel. e.g to run 2 scrapy crawlers per core:

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl