During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url?
Before I ask here, I do a lot of search with Google. Looks there's two ways to fix this issue. one is catch exception in a download middle-ware, the other is to process download exception in errback in spider's request. For these two questions, I have some questions.
- For method 1, I don't know how to pass the original url to process_exception function. Below is the example code I have tried.
class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "" log.msg('>>>> Proxy %s'%(request.meta['proxy'] if request.meta['proxy'] else ""), level=log.DEBUG) def process_exception(self, request, exception, spider): log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception))) #retry again. return request
For method 2, I don't know how to pass external parameter to errback function in the spider. I don't know how to retrieve original url from this errback function to re-schedule a request.
Below is the example I tried with method 2:
class ProxytestSpider(Spider): name = "proxytest" allowed_domains = ["baidu.com"] start_urls = ( 'http://www.baidu.com/', ) def make_requests_from_url(self, url): starturl = url request = Request(url, dont_filter=True,callback = self.parse, errback = self.download_errback) print "make requests" return request def parse(self, response): pass print "in parse function" def download_errback(self, e): print type(e), repr(e) print repr(e.value) print "in downloaderror_callback"
Any suggestion for this recrawl issue is highly appreciated. Thanks in advance.