Scrapy shell return without response

2019-05-05 09:09发布

问题:

I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site 'https://www.leboncoin.fr' but the spider doesn't work. So, I tried :

scrapy shell 'https://www.leboncoin.fr'

But, I haven't a response of the site.

$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS':    'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0,   'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1039fbd30>
[s]   item       {}
[s]   request    <GET https://www.leboncoin.fr>
[s]   settings   <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

If I use :

view(response)

An AttributeError is printed...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)

/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)
     67     from scrapy.http import HtmlResponse, TextResponse
     68     # XXX: this implementation is a bit dirty and could be improved
---> 69     body = response.body
     70     if isinstance(response, HtmlResponse):
     71         if b'<base' not in body:

AttributeError: 'NoneType' object has no attribute 'body'

Edit 1 :

To rrschmidt : the complete log was updated and when I run

fetch('https:www.leboncoin.fr') 

I receive this :

2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>

So, How can I fix it ?

Thanks for your answers,

Chris

回答1:

It looks like the website has restricted scraping via robots.txt. Its usually polite to respect that wish.

But if you really want to scrape the site you can change scrapy's default behaviour by changing the ROBOTSTXT_OBEY setting to false in your settings.py

ROBOTSTXT_OBEY=False