I want to be able to run the Scrapy web crawling framework from within Django. Scrapy itself only provides a command line tool scrapy
to execute its commands, i.e. the tool was not intentionally written to be called from an external program.
The user Mikhail Korobov came up with a nice solution, namely to call Scrapy from a Django custom management command. For convenience, I repeat his solution here:
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
return super(Command, self).run_from_argv(argv)
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
Instead of calling e.g. scrapy crawl domain.com
I can now do python manage.py scrapy crawl domain.com
from within a Django project. However, the options of a Scrapy command are not parsed at all. If I do python manage.py scrapy crawl domain.com -o scraped_data.json -t json
, I only get the following response:
Usage: manage.py scrapy [options]
manage.py: error: no such option: -o
So my question is, how to extend the custom management command to adopt Scrapy's command line options?
Unfortunately, Django's documentation of this part is not very extensive. I've also read the documentation of Python's optparse module but afterwards it was not clearer to me. Can anyone help me in this respect? Thanks a lot in advance!
Okay, I have found a solution to my problem. It's a bit ugly but it works. Since the Django project's
manage.py
command does not accept Scrapy's command line options, I split the options string into two arguments which are accepted bymanage.py
. After successful parsing, I rejoin the two arguments and pass them to Scrapy.That is, instead of writing
I put spaces in between the options like this
My handle function looks like this:
Meanwhile, Mikhail Korobov has provided the optimal solution. See here:
I think you're really looking for Guideline 10 of the POSIX argument syntax conventions:
Python's
optparse
module behaves this way, even under windows.I put the scrapy project settings module in the argument list, so I can create separate scrapy projects in independent apps:
Invoked as follows:
Tested with scrapy 0.12 and django 1.3.1