Do scrapy LinkExtractors end up with unique links?

2019-05-31 07:23发布

So, I have a page with a lot of articles and page numbers. Now if I want to extract an article I use:

Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')

for pages I use this Rule: Rule(LinkExtractor(allow='page=\d+'))

so I end up with these rules:

rules = [
    Rule(LinkExtractor(allow='page=\d+')),
    Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
]

My question is, will I get repeated pages? as in, will it extract page 3 from page 1,2,4,5,6(till page 3 is no longer visible) and add it to the extracted link list? or it only keeps unique urls at the end of it?

标签： scrapy

1条回答

该账号已被封号

2楼-- · 2019-05-31 08:06

By default, LinkExtractor should only return unique links. There is an optional parameter, unique, which is True by default.

But that only ensures the links extracted from each page are unique. If the same link occurs on a later page, it will be extracted again.

By default, your spider should automatically ensure it doesn't visit the same URLs again, according to the DUPEFILTER_CLASS setting. The only caveat to this is if you stop and start your spider again, the record of visited URLs is reset. Look at "Jobs: pausing and resuming crawls" in the documentation for how to persist information when you pause and resume a spider.

0人赞添加讨论(0) 举报

Do scrapy LinkExtractors end up with unique links?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间