Here's the python script that i am using to call scrapy, the answer of
Scrapy crawl from script always blocks script execution after scraping
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
here's my pipelines.py code
from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher
class scrapermar11Pipeline(object):
def __init__(self):
self.files = {}
dispatcher.connect(self.spider_opened , signals.spider_opened)
dispatcher.connect(self.spider_closed , signals.spider_closed)
def spider_opened(self,spider):
file = open('links_pipelines.json' ,'wb')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self,spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
log.msg('It reached here')
return item
This code is taken from here
Scrapy :: Issues with JSON export
When i run the crawler like this
scrapy crawl MySpider -a start_url='abc'
a links file with the expected output is created .But when i execute the python script it does not create any file though the crawler runs as the dumped scrapy stats are similar to those of the previous run.
I think there's a mistake in the python script as the file is getting created in the first approach .How do i get the script to output the file ?
This code worked for me:
from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'}
settings.overrides.update(mySettings)
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()
spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool
dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)
log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."
A solution that worked for me was to ditch the run script and use of the internal API and use the command line & GNU Parallel to parallelize instead.
To run all known spiders, one per core:
scrapy list | parallel --line-buffer scrapy crawl
scrapy list
lists all spiders one per line, allowed us to pipe them as arguments to append to a command (scrapy crawl
) passed to GNU Parallel instead. --line-buffer
means that output received back from the processes will be be printed to stdout mixed, but on a line-by-line basis rather than quater/half lines being garbled together (for other options look at --group
and --ungroup
).
NB: obviously this works best on machines that have multiple CPU cores as by default, GNU Parallel will run one job per core. Note that unlike many modern development machines, the cheap AWS EC2 & DigitalOcean tiers only have one virtual CPU core. Therefore if you wish to run jobs simultaneously on one core you will have to play with the --jobs
argument to GNU Parellel. e.g to run 2 scrapy crawlers per core:
scrapy list | parallel --jobs 200% --line-buffer scrapy crawl