How to ignore already crawled URLs in Scrapy

2019-08-09 17:26发布

I have a crawler that looks something like this:

def parse:
      .......
      ........
      Yield(Request(url=nextUrl,callback=self.parse2))

def parse2:
      .......
      ........
      Yield(Request(url=nextUrl,callback=self.parse3))

def parse3:
      .......
      ........

I want to add a rule wherein I want to ignore if a URL has crawled while invoking function parse2, but keep the rule for parse3. I am still exploring the requests.seen file to see if I can manipulate that.

2条回答
虎瘦雄心在
2楼-- · 2019-08-09 17:37

check out dont_filter request parameter at http://doc.scrapy.org/en/latest/topics/request-response.html

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

查看更多
forever°为你锁心
3楼-- · 2019-08-09 17:51

You can set the rule in settings.py. Refer to the doc dupefilter-class

Default: 'scrapy.dupefilter.RFPDupeFilter'

The class used to detect and filter duplicate requests.

The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function.

查看更多
登录 后发表回答