How to ignore already crawled URLs in Scrapy

2019-08-09 17:26发布

I have a crawler that looks something like this:

def parse:
      .......
      ........
      Yield(Request(url=nextUrl,callback=self.parse2))

def parse2:
      .......
      ........
      Yield(Request(url=nextUrl,callback=self.parse3))

def parse3:
      .......
      ........

I want to add a rule wherein I want to ignore if a URL has crawled while invoking function parse2, but keep the rule for parse3. I am still exploring the requests.seen file to see if I can manipulate that.

标签： python django web-crawler scrapy

2条回答

虎瘦雄心在

2楼-- · 2019-08-09 17:37

check out dont_filter request parameter at http://doc.scrapy.org/en/latest/topics/request-response.html

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

0人赞添加讨论(0) 举报

forever°为你锁心

3楼-- · 2019-08-09 17:51

You can set the rule in settings.py. Refer to the doc dupefilter-class

Default: 'scrapy.dupefilter.RFPDupeFilter'

The class used to detect and filter duplicate requests.

The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function.

0人赞添加讨论(0) 举报

How to ignore already crawled URLs in Scrapy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间