I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones.
I have the following code in the start_requests function:
for user in users:
yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p)
But this self.parse_p is called only for the Non-302 requests.
I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the RedirectMiddleware
. You could disable it and then you would have to do all the redirects manually. If you wanted to selectively disable redirects for a few types of Requests you can do it like this:
request = scrapy.Request(url, meta={'dont_redirect': True} callback=self.manual_handle_of_redirects)
I'm not sure that the intermediate Requests/Responses are very interesting though. That's also what RedirectMiddleware
believes. As a result, it does the redirects automatically and saves the intermediate URLs (the only interesting thing) in:
response.request.meta.get('redirect_urls')
You have a few options!
Example spider:
import scrapy
class DimSpider(scrapy.Spider):
name = "dim"
start_urls = (
'http://example.com/',
)
def parse(self, response):
yield scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)
def parse_p(self, response):
print response.request.meta.get('redirect_urls')
print "done!"
Example output...
DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Redirecting (302) to <GET http://myredirect.com> from <GET http://example.com/redirect302.php>
DEBUG: Crawled (200) <GET http://myredirect.com/> (referer: http://example.com/redirect302.com/)
['http://example.com/redirect302.php']
done!
If you really want to scrape the 302 pages, you have to explicitcly allow it. For example here, I allow 302
and set dont_redirect
to True
:
handle_httpstatus_list = [302]
def parse(self, response):
r = scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)
r.meta['dont_redirect'] = True
yield r
The end result is:
DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Crawled (302) <GET http://example.com/redirect302.com/> (referer: http://www.example.com/)
None
done!
This spider should manually follow 302 urls:
import scrapy
class DimSpider(scrapy.Spider):
name = "dim"
handle_httpstatus_list = [302]
def start_requests(self):
yield scrapy.Request("http://page_with_or_without_redirect.html",
callback=self.parse200_or_302, meta={'dont_redirect':True})
def parse200_or_302(self, response):
print "I'm on: %s with status %d" % (response.url, response.status)
if 'location' in response.headers:
print "redirecting"
return [scrapy.Request(response.headers['Location'],
callback=self.parse200_or_302, meta={'dont_redirect':True})]
Be careful. Don't omit setting handle_httpstatus_list = [302]
otherwise you will get "HTTP status code is not handled or not allowed".
By default, scrapy is not following 302 redirects.
In your spider you can make use of the custom_settings attribute:
custom_settings
A dictionary of settings that will be overridden from the project wide configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation.
Set the number of redirects that a url request can be redirected as follows:
class MySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = [ "http://www.example.com" ]
custom_settings = { 'REDIRECT_MAX_TIMES': 333 }
def start_requests(self):
# Your code here
I set 333 as an example limit.
I hope this helps.