How to pass custom settings through CrawlerProcess

I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this:

storage_settings = {'FEED_FORMAT': 'csv', 'FEED_URI': 'foo.csv'}
process = CrawlerProcess(get_project_settings())
process.crawl('ABC', crawl_links=main_links, custom_settings=storage_settings )
process.start()

and in my spider I read them as an argument:

    def __init__(self, crawl_links=None, allowed_domains=None, customom_settings=None,  *args, **kwargs):
    self.start_urls = crawl_links
    self.allowed_domains = allowed_domains
    self.custom_settings = custom_settings
    self.rules = ......
    super(mySpider, self).__init__(*args, **kwargs)

but how can I tell my project settings file "settings.py" about these custom settings? I don't want to hard code them, rather I want them to be read automatically.

标签： python web-scraping scrapy scrapy-spider

2条回答

劳资没心，怎么记你

2楼-- · 2019-05-18 12:12

Do not pass settings to crawl() method. And also pass class name of your spider as first argument to crawl().

from my_crawler.spiders.my_scraper import MySpider
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

process = CrawlerProcess(get_project_settings())

process.crawl(MySpider(), crawl_links=main_links)

process.start()

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2019-05-18 12:17

You cannot tell your file about these settings. You are perhaps confused between crawler settings and spider settings. In scrapy, the feed paramaters as of the time of this wrting need to be passed to the crawler process and not to the spider. You have to pass them as parameters to your crawler process. I have the same use case as you. What you do is read the current project settings and then override it for each crawler process. Please see the example code below:

s = get_project_settings()
s['FEED_FORMAT'] = 'csv'
s['LOG_LEVEL'] = 'INFO'
s['FEED_URI'] = 'Q1.csv'
s['LOG_FILE'] = 'Q1.log'

proc = CrawlerProcess(s)

And then your call to process.crawl() is not correct. The name of the spider should be passed as the first argument as a string, like this: process.crawl('MySpider', crawl_links=main_links) and of course MySpider should be the value given to the name attribute in your spider class.

0人赞添加讨论(0) 举报

How to pass custom settings through CrawlerProcess

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间