So, I have a page with a lot of articles and page numbers. Now if I want to extract an article I use:
Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
for pages I use this Rule: Rule(LinkExtractor(allow='page=\d+'))
so I end up with these rules:
rules = [
Rule(LinkExtractor(allow='page=\d+')),
Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
]
My question is, will I get repeated pages? as in, will it extract page 3 from page 1,2,4,5,6(till page 3 is no longer visible) and add it to the extracted link list? or it only keeps unique urls at the end of it?
By default,
LinkExtractor
should only return unique links. There is an optional parameter,unique
, which isTrue
by default.But that only ensures the links extracted from each page are unique. If the same link occurs on a later page, it will be extracted again.
By default, your spider should automatically ensure it doesn't visit the same URLs again, according to the
DUPEFILTER_CLASS
setting. The only caveat to this is if you stop and start your spider again, the record of visited URLs is reset. Look at "Jobs: pausing and resuming crawls" in the documentation for how to persist information when you pause and resume a spider.