Python Scrapy: passing properties into parser

2019-09-02 02:48发布

I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes.

I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL.

I guess I could have a separate spider for each category, then just hold the category as an attribute on the class so the parser can pick it up from there. But I was kind of hoping I could have just one spider for all the URLs, but tell the parser which category to use for a given URL.

Right now, I'm setting up the URLs in start_urls via my spider's init() method. How do I pass the category for a given URL from my init method to the parser so that I can record the category on the Items generated from the responses for that URL?

标签： python scrapy

1条回答

爷的心禁止访问

2楼-- · 2019-09-02 03:25

As paul t. suggested:

class MySpider(CrawlSpider):

    def start_requests(self):
        ...
        yield Request(url1, meta={'category': 'cat1'}, callback=self.parse)
        yield Request(url2, meta={'category': 'cat2'}, callback=self.parse)
        ...

    def parse(self, response):
        category = response.meta['category']
        ...

You use start_requests to have control over the first URLs you're visiting, attaching metadata to each URL, and you can access that metadata through response.meta afterwards.

Same thing if you need to pass data from a parse function to a parse_item, for instance.

0人赞添加讨论(0) 举报

Python Scrapy: passing properties into parser

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间