I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")
I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.
Any ideas on how to do this?
Currently my code is something along the lines of :
def start_requests(self):
baseUrl = "http://website.com/page/"
currentPage = 0
stillExists = True
while(stillExists):
currentUrl = baseUrl + str(currentPage)
test = Request(currentUrl)
if test.response.status != 404: #This is what I'm not sure of
yield test
currentPage += 1
else:
stillExists = False
You can do something like this:
This iterates through the range and attempts to connect to each URL via
urllib2
. I don't knowscapy
or how your example function opens the URL but this is an example with how to do it viaurllib2
.Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom
404 - Not Found
page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.You need to yield/return the request in order to check the status, creating a
Request
object does not actually send it.