I am writing a crawler for a website using scrapy with CrawlSpider.
Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.
What I want to do is to filter requests like:
http:://www.abc.com/p/xyz.html?id=1234&refer=5678
If I have already visited
http:://www.abc.com/p/xyz.html?id=1234&refer=4567
NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.
Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.
But that would mean I am still at least fetching that page, when I don't need to.
So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?
Here is my custom filter base on scrapy 0.24.6.
In this filter, it only cares id in the url. for example
http://www.example.com/products/cat1/1000.html?p=1
http://www.example.com/products/cat2/1000.html?p=2
are treated as same url. But
http://www.example.com/products/cat2/all.html
will not.
You can write custom middleware for duplicate removal and add it in settings
Then you need to set the correct DUPFILTER_CLASS in settings.py
It should work after that
https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py
This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={'deltafetch_key':uniqe_url_key}). This this let you avoid duplicate requests you already have visited in the past.
A sample mongodb implementation using deltafetch.py
eg. id = 345 scrapy.Request(url,meta={deltafetch_key:345},callback=parse)
Following ytomar's lead, I wrote this filter that filters based purely on URLs that have already been seen by checking an in-memory set. I'm a Python noob so let me know if I screwed something up, but it seems to work all right:
As ytomar mentioned, be sure to add the
DUPEFILTER_CLASS
constant tosettings.py
: