I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones.
I have the following code in the start_requests function:
for user in users:
yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p)
But this self.parse_p is called only for the Non-302 requests.
By default, scrapy is not following 302 redirects.
In your spider you can make use of the custom_settings attribute:
Set the number of redirects that a url request can be redirected as follows:
I set 333 as an example limit.
I hope this helps.
I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the
RedirectMiddleware
. You could disable it and then you would have to do all the redirects manually. If you wanted to selectively disable redirects for a few types of Requests you can do it like this:I'm not sure that the intermediate Requests/Responses are very interesting though. That's also what
RedirectMiddleware
believes. As a result, it does the redirects automatically and saves the intermediate URLs (the only interesting thing) in:You have a few options!
Example spider:
Example output...
If you really want to scrape the 302 pages, you have to explicitcly allow it. For example here, I allow
302
and setdont_redirect
toTrue
:The end result is:
This spider should manually follow 302 urls:
Be careful. Don't omit setting
handle_httpstatus_list = [302]
otherwise you will get "HTTP status code is not handled or not allowed".