I'm trying to write a very simple website crawler to list URLs along with referrer and status codes for 200, 301, 302 and 404 http status codes.
Turns out that Scrapy works great and my script uses it correctly to crawl the website and can list urls with 200 and 404 status codes without problems.
The problem is: I can't find how to have scrapy follow redirects AND parse/output them. I can get one to work but not both.
What I've tried so far:
setting
meta={'dont_redirect':True}
and settingREDIRECTS_ENABLED = False
adding 301, 302 to handle_httpstatus_list
changing settings specified in the redirect middleware doc
reading the redirect middleware code for insight
various combo of all of the above
other random stuff
Here's the public repo if you want to take a look at the code.
If you want to parse 301 and 302 responses, and follow them at the same time, ask for 301 and 302 to be processed by your callback and mimick the behavior of RedirectMiddleware.
Test 1 (not working)
Let's illustrate with a simple spider to start with (not working as you intend yet):
Right now, the spider asks for 2 pages, and the 2nd one should redirect to http://www.example.com
The 302 is handled by
RedirectMiddleware
automatically and it does not get passed to your callback.Test 2 (still not quite right)
Let's configure the spider to handle 301 and 302s in the callback, using
handle_httpstatus_list
:Let's run it:
Here, we're missing the redirection.
Test 3 (working)
Do the same as RedirectMiddleware but in the spider callback:
And run the spider again:
We got redirected to http://www.example.com and we also got the response through our callback.