Scrapy 404 error: HTTP status code is not handled

2020-02-13 02:29发布

I'm trying to scrape the site coursetalk using scrapy, I'm trying with the spider template first and getting a 404 error:

2017-12-29 23:34:30 [scrapy] DEBUG: Ignoring response <404 https://www.coursetalk.com/subjects/data-science/courses/>: HTTP status code is not handled or not allowed

This is the code I'm using:

import scrapy

class ListaDeCursosSpider(scrapy.Spider):
    name = "lista_de_cursos"
    start_urls = ['https://www.coursetalk.com/subjects/data-science/courses/'] 


    def parse(self, response):
        print response.body

And the compete log from scrapy:

2017-12-29 23:34:26 [scrapy] INFO: Scrapy 1.0.3 started (bot: coursetalk)
2017-12-29 23:34:26 [scrapy] INFO: Optional features available: ssl, http11, boto
2017-12-29 23:34:26 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'coursetalk.spiders', 'SPIDER_MODULES': ['coursetalk.spiders'], 'BOT_NAME': 'coursetalk'}
2017-12-29 23:34:27 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2017-12-29 23:34:27 [boto] DEBUG: Retrieving credentials from metadata server.
2017-12-29 23:34:28 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-12-29 23:34:28 [boto] ERROR: Unable to read instance data, giving up
2017-12-29 23:34:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-12-29 23:34:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-12-29 23:34:28 [scrapy] INFO: Enabled item pipelines: 
2017-12-29 23:34:28 [scrapy] INFO: Spider opened
2017-12-29 23:34:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-29 23:34:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-29 23:34:30 [scrapy] DEBUG: Crawled (404) <GET https://www.coursetalk.com/subjects/data-science/courses/> (referer: None)
2017-12-29 23:34:30 [scrapy] DEBUG: Ignoring response <404 https://www.coursetalk.com/subjects/data-science/courses/>: HTTP status code is not handled or not allowed
2017-12-29 23:34:30 [scrapy] INFO: Closing spider (finished)

标签: python scrapy
2条回答
手持菜刀,她持情操
2楼-- · 2020-02-13 03:02

Looks like this website is so weird that response status code is 404 but still can fetch the body normally.

And in Scrapy HttpErrorMiddleware is default enabled ,which would filter out unsuccessful Http responses so that spiders don't have to deal with them.And in this case,scrapy provides HTTPERROR_ALLOWED_CODES setting to allows to deal with response even if returning error codes.

And adding HTTPERROR_ALLOWED_CODES =[404] in the project setting.py would bypass this issue

import scrapy
import logging

class ListaDeCursosSpider(scrapy.Spider):
    name = "lista_de_cursos"
    allowed_domains = ['www.coursetalk.com']
    start_urls = ['https://www.coursetalk.com/subjects/data-science/courses/'] 

 def parse(self, response):
        logging.info("response.status:%s"%response.status)
        logourl = response.selector.css('div.main-nav__logo img').xpath('@src').extract()
        logging.info('response.logourl:%s'%logourl)
查看更多
\"骚年 ilove
3楼-- · 2020-02-13 03:23

I have faced this problem with scrapy and solved it.

Changed USER_AGENT in setting.py

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

查看更多
登录 后发表回答