I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")
I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.
Any ideas on how to do this?
Currently my code is something along the lines of :
def start_requests(self):
baseUrl = "http://website.com/page/"
currentPage = 0
stillExists = True
while(stillExists):
currentUrl = baseUrl + str(currentPage)
test = Request(currentUrl)
if test.response.status != 404: #This is what I'm not sure of
yield test
currentPage += 1
else:
stillExists = False
You can do something like this:
from __future__ import print_function
import urllib2
baseURL = "http://www.website.com/page/"
for n in xrange(100):
fullURL = baseURL + str(n)
#print fullURL
try:
req = urllib2.Request(fullURL)
resp = urllib2.urlopen(req)
if resp.getcode() == 404:
#Do whatever you want if 404 is found
print ("404 Found!")
else:
#Do your normal stuff here if page is found.
print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
except:
print ("Could not connect to URL: {0} ".format(fullURL))
This iterates through the range and attempts to connect to each URL via urllib2
. I don't know scapy
or how your example function opens the URL but this is an example with how to do it via urllib2
.
Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found
page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.
You need to yield/return the request in order to check the status, creating a Request
object does not actually send it.
class MySpider(BaseSpider):
name = 'website.com'
baseUrl = "http://website.com/page/"
def start_requests(self):
yield Request(self.baseUrl + '0')
def parse(self, response):
if response.status != 404:
page = response.meta.get('page', 0) + 1
return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))