I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last code snippet at official scrapy quotes example spider).
class QuotesSpider(Spider):
name = "quotes"
def __init__(self, somestring, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.somestring = somestring
self.custom_settings = kwargs
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Here is the script through which I try to run the quotes spider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
def main():
proc = CrawlerProcess(get_project_settings())
custom_settings_spider = \
{
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
}
proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()
It seems you want to have custom log for each spiders. You need to activate the logging like this:
Scrapy Settings are a bit like Python dicts. So you can update the settings object before passing it to
CrawlerProcess
:Edit following OP's comments:
Here's a variation using
CrawlerRunner
, with a newCrawlerRunner
for each crawl and re-configuring logging at each iteration to write to different files each time:I think you can't override the
custom_settings
variable of a Spider Class when calling it as a script, basically because the settings are being loaded before the spider is instantiated.Now, I don't really see a point on changing the
custom_settings
variable specifically, as it is only a way to override your default settings, and that's exactly what theCrawlerProcess
offers too, this works as expected:You can override a setting from the command line
https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options
For example:
scrapy crawl myspider -s LOG_FILE=scrapy.log