How to add attributes to a request in a scrapy con

2019-06-02 16:57发布

Scrapy contract fails if we are instantiating an Item or ItemLoader with the meta attribute or the Request() object passed from a previous parse method.

I was thinking of maybe overriding ScrapesContract to preprocess the request and load some dummy values in request.meta, not sure if that is good practice though.

I have seen the pre_process method in the docs (illustrated in the HasHeaderContract at the bottom) to get attributes from the request object, but I'm not sure if it can be used to set attributes.

EDIT: More details. Methods from an example crawler:

def parse_level_one(self, response):
   # populate loader
   return Request(url=url, callback=self.parse_level_two, meta={'loader': loader.load_item()})

def parse_level_two(self, response):
    """Parse product detail page

    @url http://example.com
    @scrapes some_field1 some_field2
    """
    loader = MyItemLoader(response.meta['loader'], response=response)

in the cli

$ scrapy check crawlername
Traceback... loader = MyItemLoader(response.meta['loader'], response=response)
KeyError: 'loader'

The idea that I am thinking about is this:

class LoadedScrapesContract(Contract):
    """ Contract to check presence of fields in scraped items
        @loadedscrapes page_name page_body
    """

    name = 'loadedscrapes'

    def pre_process(self, response):
        # MEDDLE WITH THE RESPONSE OBJECT HERE
        # TO ADD A META ATTRIBUTE TO RESPONSE,
        # LIKE AN EMPTY Item() or dict, JUST TO MAKE
        # THE ITEM LOADER INSTANTIATION PASS

    # this is same as ScrapesContract 
    def post_process(self, output):
        for x in output:
            if isinstance(x, BaseItem):
                for arg in self.args:
                    if not arg in x:
                        raise ContractFail("'%s' field is missing" % arg)

标签: python scrapy
1条回答
时光不老,我们不散
2楼-- · 2019-06-02 17:05

The best solution I've found for this, is to do the following rather than mucking up the contract

loader = MyItemLoader(response.meta.get('loader', MyItem()), response=response)

I prefer this method, but to stick the question, override adjust_request_args

查看更多
登录 后发表回答