scrapy how to set referer url

I need to set the referer url, before scraping a site, the site uses refering url based Authentication, so it does not allow me to login if the referer is not valid.

Could someone tell how to do this in Scrapy?

标签： screen-scraping scrapy

4条回答

淡お忘

2楼-- · 2020-02-11 00:37

Just set Referer url in the Request headers

class scrapy.http.Request(url[, method='GET', body, headers, ...

headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).

Example:

return Request(url=your_url, headers={'Referer':'http://your_referer_url'})

0人赞添加讨论(0) 举报

小情绪 Triste *

3楼-- · 2020-02-11 00:39

Override BaseSpider.start_requests and create there your custom Request passing it your referer header.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2020-02-11 00:53

If you want to change the referer in your spider's request, you can change DEFAULT_REQUEST_HEADERS in the settings.py file:

DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://www.google.com' 
}

0人赞添加讨论(0) 举报

我命由我不由天

5楼-- · 2020-02-11 01:01

You should do exactly as @warwaruk indicated, below is my example elaboration for a crawl spider:

from scrapy.spiders import CrawlSpider
from scrapy import Request

class MySpider(CrawlSpider):
  name = "myspider"
  allowed_domains = ["example.com"]
  start_urls = [
      'http://example.com/foo'
      'http://example.com/bar'
      'http://example.com/baz'
      ]
  rules = [(...)]

  def start_requests(self):
    requests = []
    for item in self.start_urls:
      requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'}))
    return requests    

  def parse_me(self, response):
    (...)

This should generate following logs in your terminal:

(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/)
(...)

Will work same with BaseSpider. In the end start_requests method is BaseSpider method, from which CrawlSpider inherits from.

Documentation explains more options to be set in Request apart from headers, such as: cookies , callback function, priority of the request etc.

0人赞添加讨论(0) 举报

scrapy how to set referer url

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间