Do scrapy LinkExtractors end up with unique links?

2019-05-31 07:23发布

So, I have a page with a lot of articles and page numbers. Now if I want to extract an article I use:

Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')

for pages I use this Rule: Rule(LinkExtractor(allow='page=\d+'))

so I end up with these rules:

rules = [
    Rule(LinkExtractor(allow='page=\d+')),
    Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
]

My question is, will I get repeated pages? as in, will it extract page 3 from page 1,2,4,5,6(till page 3 is no longer visible) and add it to the extracted link list? or it only keeps unique urls at the end of it?

标签: scrapy
1条回答
该账号已被封号
2楼-- · 2019-05-31 08:06

By default, LinkExtractor should only return unique links. There is an optional parameter, unique, which is True by default.

But that only ensures the links extracted from each page are unique. If the same link occurs on a later page, it will be extracted again.

By default, your spider should automatically ensure it doesn't visit the same URLs again, according to the DUPEFILTER_CLASS setting. The only caveat to this is if you stop and start your spider again, the record of visited URLs is reset. Look at "Jobs: pausing and resuming crawls" in the documentation for how to persist information when you pause and resume a spider.

查看更多
登录 后发表回答