Checking a url for a 404 error scrapy

I'm going through a set of pages and I'm not certain how many there are, but the current page is represented by a simple number present in the url (e.g. "http://www.website.com/page/1")

I would like to use a for loop in scrapy to increment the current guess at the page and stop when it reaches a 404. I know the response that is returned from the request contains this information, but I'm not sure how to automatically get a response from a request.

Any ideas on how to do this?

Currently my code is something along the lines of :

def start_requests(self):
    baseUrl = "http://website.com/page/"
    currentPage = 0
    stillExists = True
    while(stillExists):
        currentUrl = baseUrl + str(currentPage)
        test = Request(currentUrl)
        if test.response.status != 404: #This is what I'm not sure of
            yield test
            currentPage += 1
        else:
            stillExists = False

标签： python web-scraping http-status-code-404 scrapy

2条回答

Fickle 薄情

2楼-- · 2019-02-19 19:39

You can do something like this:

from __future__ import print_function
import urllib2

baseURL = "http://www.website.com/page/"

for n in xrange(100):
    fullURL = baseURL + str(n)
    #print fullURL
    try:
        req = urllib2.Request(fullURL)
        resp = urllib2.urlopen(req)
        if resp.getcode() == 404:
            #Do whatever you want if 404 is found
            print ("404 Found!")
        else:
            #Do your normal stuff here if page is found.
            print ("URL: {0} Response: {1}".format(fullURL, resp.getcode()))
    except:
        print ("Could not connect to URL: {0} ".format(fullURL))

This iterates through the range and attempts to connect to each URL via urllib2. I don't know scapy or how your example function opens the URL but this is an example with how to do it via urllib2.

Note that most sites that utilize this type of URL format are normally running a CMS that can automatically redirect non-existent pages to a custom 404 - Not Found page which will still show up as a HTTP status code of 200. In this case, the best way to look for a page that may show up but is actually just the custom 404 page, you should do some screen scraping and look for anything that may not appear during a "normal" page return such as text that says "Page not found" or something similar and unique to the custom 404 page.

0人赞添加讨论(0) 举报

放我归山

3楼-- · 2019-02-19 19:48

You need to yield/return the request in order to check the status, creating a Request object does not actually send it.

class MySpider(BaseSpider):
    name = 'website.com'
    baseUrl = "http://website.com/page/"

    def start_requests(self):
        yield Request(self.baseUrl + '0')

    def parse(self, response):
        if response.status != 404:
            page = response.meta.get('page', 0) + 1
            return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))

0人赞添加讨论(0) 举报

Checking a url for a 404 error scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间