How to retry the request n times when an item gets

2020-05-19 21:14发布

问题:

I'm trying to scrap a range of webpages but I got holes, sometimes it looks like the website fails to send the html response correctly. This results in the csv output file to have empty lines. How would one do to retry n times the request and the parse when the xpath selector on the response is empty ? Note that I don't have any HTTP errors.

回答1:

you could do this with a Custom Retry Middleware, you just need to override the process_response method of the current Retry Middleware:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message


class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        # this is your check
        if response.status == 200 and response.xpath(spider.retry_xpath):
            return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
        return response

Then enable it instead of the default RetryMiddleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}

Now you have a middleware where you can configure the xpath to retry inside your spider with the attribute retry_xpath:

class MySpider(Spider):
    name = "myspidername"

    retry_xpath = '//h2[@class="tadasdop-cat"]'
    ...

This won't necessarily retry when your Item's field is empty, but you can specify the same path of that field in this retry_xpath attribute to make it work.



回答2:

You can set RETRY_TIMES setting in settings.py to the amount of times you wish the pages are retried. It defaults to 2 times.

See more on RetryMiddleware



标签: scrapy