Capturing http status codes with scrapy spider

2019-03-19 10:19发布

I am new to scrapy. I am writing a spider designed to check a long list of urls for the server status codes and, where appropriate, what URLs they are redirected to. Importantly, if there is a chain of redirects, I need to know the status code and url at each jump. I am using response.meta['redirect_urls'] to capture the urls, but am unsure how to capture the status codes - there doesn't seem to be a response meta key for it.

I realise I may need to write some custom middlewear to expose these values but am not quite clear how to log the status codes for every hop, nor how to access these values from the spider. I've had a look but can't find an example of anyone doing this. If anyone can point me in the right direction it would be much appreciated.

For example,

    items = []
    item = RedirectItem()
    item['url'] = response.url
    item['redirected_urls'] = response.meta['redirect_urls']     
    item['status_codes'] = #????
    items.append(item)

Edit - Based on feedback from warawauk and some really proactive help from the guys on the IRC channel (freenode #scrappy) I've managed to do this. I believe it's a little hacky so any comments for improvement welcome:

(1) Disable the default middleware in the settings, and add your own:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
    'myproject.middlewares.CustomRedirectMiddleware': 100,
}

(2) Create your CustomRedirectMiddleware in your middlewares.py. It inherits from the main redirectmiddleware class and captures the redirect:

class CustomRedirectMiddleware(RedirectMiddleware):
    """Handle redirection of requests based on response status and meta-refresh html tag"""

    def process_response(self, request, response, spider):
        #Get the redirect status codes
        request.meta.setdefault('redirect_status', []).append(response.status)
        if 'dont_redirect' in request.meta:
            return response
        if request.method.upper() == 'HEAD':
            if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
                redirected_url = urljoin(request.url, response.headers['location'])
                redirected = request.replace(url=redirected_url)

                return self._redirect(redirected, request, spider, response.status)
            else:
                return response

        if response.status in [302, 303] and 'Location' in response.headers:
            redirected_url = urljoin(request.url, response.headers['location'])
            redirected = self._redirect_request_using_get(request, redirected_url)
            return self._redirect(redirected, request, spider, response.status)

        if response.status in [301, 307] and 'Location' in response.headers:
            redirected_url = urljoin(request.url, response.headers['location'])
            redirected = request.replace(url=redirected_url)
            return self._redirect(redirected, request, spider, response.status)

        if isinstance(response, HtmlResponse):
            interval, url = get_meta_refresh(response)
            if url and interval < self.max_metarefresh_delay:
                redirected = self._redirect_request_using_get(request, url)
                return self._redirect(redirected, request, spider, 'meta refresh')


        return response

(3) You can now access the list of redirects in your spider with

request.meta['redirect_status']

3条回答
该账号已被封号
2楼-- · 2019-03-19 10:57

KISS solution: I thought it was better to add the strict minimum of code for capturing the new redirect field, and let RedirectMiddleware do the rest:

from scrapy.contrib.downloadermiddleware.redirect import RedirectMiddleware

class CustomRedirectMiddleware(RedirectMiddleware):
  """Handle redirection of requests based on response status and meta-refresh html tag"""

  def process_response(self, request, response, spider):
    #Get the redirect status codes
    request.meta.setdefault('redirect_status', []).append(response.status)
    response = super(CustomRedirectMiddleware, self).process_response(request, response, spider)
    return response

Then, subclassing BaseSpider, you may access the redirect_status with the following:

    def parse(self, response):
      item = ScrapyGoogleindexItem()
      item['redirections'] = response.meta.get('redirect_times', 0)
      item['redirect_status'] = response.meta['redirect_status']
      return item
查看更多
男人必须洒脱
3楼-- · 2019-03-19 10:59

response.meta['redirect_urls' is populated by RedirectMiddleware. Your spider callback will never receive responses in between, only the last one after all redirects.

If you want to control the process, subclass RedirectMiddleware, disable the original one, and enable yours. Then you can control the redirection process, including tracking the response statuses.

Here is the original implementation (scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware):

class RedirectMiddleware(object):
    """Handle redirection of requests based on response status and meta-refresh html tag"""

    def _redirect(self, redirected, request, spider, reason):
        ...
            redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + \
                [request.url]

As you see _redirect method which is called from different parts creates meta['redirect_urls']

And in the process_response method return self._redirect(redirected, request, spider, response.status) is called, meaning that the original response is not passed to the spider.

查看更多
倾城 Initia
4楼-- · 2019-03-19 11:04
登录 后发表回答