HTTP 403 Responses when using Python Scrapy

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored.com, which is for football statistics:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

The code is executing without any errors, however of the 4623 pages scraped, 217 got a HTTP response code of 200, 2 got a code of 302 and 4404 got a 403 response. Can anyone see anything immediately obvious in the code as to why this might be? Could this be an anti Scraping measure from the site? Is it usual practice to slow the number of submissions made to stop this happening?

Thanks

标签： python http scrapy

2条回答

虎瘦雄心在

2楼-- · 2020-06-03 06:33

I do not if this still available, but I have to put the next lines in the setting.py file:

HTTPERROR_ALLOWED_CODES  =[404]
USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

hope it helps.

0人赞添加讨论(0) 举报

对你真心纯属浪费

3楼-- · 2020-06-03 06:38

HTTP Status Code 403 definitely means Forbidden / Access Denied.
HTTP Status Code 302 is for redirection of requests. No need to worry about them.
Nothing seems to be wrong in your code.

Yes, it's definitely an anti-scraping measure implemented by the site.

Refer these guidelines from Scrapy Docs: Avoid Getting Banned

Also, you should consider pausing and resuming crawls.

0人赞添加讨论(0) 举报

HTTP 403 Responses when using Python Scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间