I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports
from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue
class CrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)
def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)
# Usage
if __name__ == "__main__":
log.start()
"""
This example runs spider1 and then spider2 three times.
"""
items = list()
crawler = CrawlerScript()
items.append(crawler.crawl('spider1'))
for i in range(3):
items.append(crawler.crawl('spider2'))
print items
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date : Oct 24, 2010
Thank you.
Though I haven't tried it I think the answer can be found within the scrapy documentation. To quote directly from it:
From what I gather this is a new development in the library which renders some of the earlier approaches online (such as that in the question) obsolete.
All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:
In scrapy 0.19.x you should do this:
Note these lines
Without it your spider won't use your settings and will not save the items. Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.
One more to do so is just call command directly from you script
Copied this answer from my first answer in here: https://stackoverflow.com/a/19060485/1402286
When there are multiple crawlers need to be run inside one python script, the reactor stop needs to be handled with caution as the reactor can only be stopped once and cannot be restarted.
However, I found while doing my project that using
is the easiest. This will save me from handling all sorts of signals especially when I have multiple spiders.
If Performance is a concern, you can use multiprocessing to run your spiders in parallel, something like:
Put this code to the path you can run
scrapy crawl abc_spider
from command line. (Tested with Scrapy==0.24.6)If you want to run a simple crawling, It's easy by just running command:
scrapy crawl . There is another options to export your results to store in some formats like: Json, xml, csv.
scrapy crawl -o result.csv or result.json or result.xml.
you may want to try it