I am new to scrapy. I am writing a spider designed to check a long list of urls for the server status codes and, where appropriate, what URLs they are redirected to. Importantly, if there is a chain of redirects, I need to know the status code and url at each jump. I am using response.meta['redirect_urls'] to capture the urls, but am unsure how to capture the status codes - there doesn't seem to be a response meta key for it.
I realise I may need to write some custom middlewear to expose these values but am not quite clear how to log the status codes for every hop, nor how to access these values from the spider. I've had a look but can't find an example of anyone doing this. If anyone can point me in the right direction it would be much appreciated.
For example,
items = []
item = RedirectItem()
item['url'] = response.url
item['redirected_urls'] = response.meta['redirect_urls']
item['status_codes'] = #????
items.append(item)
Edit - Based on feedback from warawauk and some really proactive help from the guys on the IRC channel (freenode #scrappy) I've managed to do this. I believe it's a little hacky so any comments for improvement welcome:
(1) Disable the default middleware in the settings, and add your own:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'myproject.middlewares.CustomRedirectMiddleware': 100,
}
(2) Create your CustomRedirectMiddleware in your middlewares.py. It inherits from the main redirectmiddleware class and captures the redirect:
class CustomRedirectMiddleware(RedirectMiddleware):
"""Handle redirection of requests based on response status and meta-refresh html tag"""
def process_response(self, request, response, spider):
#Get the redirect status codes
request.meta.setdefault('redirect_status', []).append(response.status)
if 'dont_redirect' in request.meta:
return response
if request.method.upper() == 'HEAD':
if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
else:
return response
if response.status in [302, 303] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = self._redirect_request_using_get(request, redirected_url)
return self._redirect(redirected, request, spider, response.status)
if response.status in [301, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
if isinstance(response, HtmlResponse):
interval, url = get_meta_refresh(response)
if url and interval < self.max_metarefresh_delay:
redirected = self._redirect_request_using_get(request, url)
return self._redirect(redirected, request, spider, 'meta refresh')
return response
(3) You can now access the list of redirects in your spider with
request.meta['redirect_status']
KISS solution: I thought it was better to add the strict minimum of code for capturing the new redirect field, and let RedirectMiddleware do the rest:
Then, subclassing BaseSpider, you may access the redirect_status with the following:
response.meta['redirect_urls'
is populated by RedirectMiddleware. Your spider callback will never receive responses in between, only the last one after all redirects.If you want to control the process, subclass
RedirectMiddleware
, disable the original one, and enable yours. Then you can control the redirection process, including tracking the response statuses.Here is the original implementation (scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware):
As you see
_redirect
method which is called from different parts createsmeta['redirect_urls']
And in the
process_response
methodreturn self._redirect(redirected, request, spider, response.status)
is called, meaning that the original response is not passed to the spider.I believe that's available as
See http://doc.scrapy.org/en/0.14/topics/request-response.html#scrapy.http.Response