Here's the python script that i am using to call scrapy, the answer of
Scrapy crawl from script always blocks script execution after scraping
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
here's my pipelines.py code
from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher
class scrapermar11Pipeline(object):
def __init__(self):
self.files = {}
dispatcher.connect(self.spider_opened , signals.spider_opened)
dispatcher.connect(self.spider_closed , signals.spider_closed)
def spider_opened(self,spider):
file = open('links_pipelines.json' ,'wb')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self,spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
log.msg('It reached here')
return item
This code is taken from here
Scrapy :: Issues with JSON export
When i run the crawler like this
scrapy crawl MySpider -a start_url='abc'
a links file with the expected output is created .But when i execute the python script it does not create any file though the crawler runs as the dumped scrapy stats are similar to those of the previous run. I think there's a mistake in the python script as the file is getting created in the first approach .How do i get the script to output the file ?
A solution that worked for me was to ditch the run script and use of the internal API and use the command line & GNU Parallel to parallelize instead.
To run all known spiders, one per core:
scrapy list
lists all spiders one per line, allowed us to pipe them as arguments to append to a command (scrapy crawl
) passed to GNU Parallel instead.--line-buffer
means that output received back from the processes will be be printed to stdout mixed, but on a line-by-line basis rather than quater/half lines being garbled together (for other options look at--group
and--ungroup
).NB: obviously this works best on machines that have multiple CPU cores as by default, GNU Parallel will run one job per core. Note that unlike many modern development machines, the cheap AWS EC2 & DigitalOcean tiers only have one virtual CPU core. Therefore if you wish to run jobs simultaneously on one core you will have to play with the
--jobs
argument to GNU Parellel. e.g to run 2 scrapy crawlers per core:This code worked for me: