Python Scrapy: passing properties into parser

2019-09-02 02:48发布

I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes.

I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL.

I guess I could have a separate spider for each category, then just hold the category as an attribute on the class so the parser can pick it up from there. But I was kind of hoping I could have just one spider for all the URLs, but tell the parser which category to use for a given URL.

Right now, I'm setting up the URLs in start_urls via my spider's init() method. How do I pass the category for a given URL from my init method to the parser so that I can record the category on the Items generated from the responses for that URL?

标签: python scrapy
1条回答
爷的心禁止访问
2楼-- · 2019-09-02 03:25

As paul t. suggested:

class MySpider(CrawlSpider):

    def start_requests(self):
        ...
        yield Request(url1, meta={'category': 'cat1'}, callback=self.parse)
        yield Request(url2, meta={'category': 'cat2'}, callback=self.parse)
        ...

    def parse(self, response):
        category = response.meta['category']
        ...

You use start_requests to have control over the first URLs you're visiting, attaching metadata to each URL, and you can access that metadata through response.meta afterwards.

Same thing if you need to pass data from a parse function to a parse_item, for instance.

查看更多
登录 后发表回答