I'm trying to scrap a range of webpages but I got holes, sometimes it looks like the website fails to send the html response correctly. This results in the csv output file to have empty lines. How would one do to retry n times the request and the parse when the xpath selector on the response is empty ? Note that I don't have any HTTP errors.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
you could do this with a Custom Retry Middleware, you just need to override the process_response
method of the current Retry Middleware:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
# this is your check
if response.status == 200 and response.xpath(spider.retry_xpath):
return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
return response
Then enable it instead of the default RetryMiddleware
in settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}
Now you have a middleware where you can configure the xpath
to retry inside your spider with the attribute retry_xpath
:
class MySpider(Spider):
name = "myspidername"
retry_xpath = '//h2[@class="tadasdop-cat"]'
...
This won't necessarily retry when your Item's field is empty, but you can specify the same path of that field in this retry_xpath
attribute to make it work.
回答2:
You can set RETRY_TIMES
setting in settings.py
to the amount of times you wish the pages are retried. It defaults to 2 times.
See more on RetryMiddleware