Avoid Duplicate URL Crawling

2020-02-08 02:55发布

I coded a simple crawler. In the settings.py file, by referring to scrapy documentation, I used

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'

If I stop the crawler and restart the crawler again, it is scraping the duplicate urls again. Am I doing something wrong?

标签： scrapy

3条回答

萌系小妹纸

2楼-- · 2020-02-08 03:08

you can rewrite Scheduler with Redis like scrapy-redis then you can avoid duplicate URL crawling when reruning your project.

0人赞添加讨论(0) 举报

疯言疯语

3楼-- · 2020-02-08 03:22

According to the documentation, DUPEFILTER_CLASS is already set to scrapy.dupefilter.RFPDupeFilter by default.

RFPDupeFilter doesn't help if you stop the crawler - it only works while actual crawling, helps you to avoid scraping duplicate urls.

It looks like you need to create your own, custom filter based on RFPDupeFilter, like it was done here: how to filter duplicate requests based on url in scrapy. If you want your filter to work between scrapy crawl sessions, you should keep the list of crawled urls somewhere in the database, or csv file.

Hope that helps.

0人赞添加讨论(0) 举报

一纸荒年 Trace。

4楼-- · 2020-02-08 03:30

I believe what you are looking for is "persistence support", to pause and resume crawls.

To enable it you can do:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

You can read more about it here.

0人赞添加讨论(0) 举报

Avoid Duplicate URL Crawling

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间