I'd like to override Scrapy's default RFPDupefilter class as follows:
from scrapy.dupefilters import RFPDupeFilter
class URLDupefilter(RFPDupeFilter):
def request_fingerprint(self, request):
if not request.url.endswith('.xml'):
return request.url
The rationale is that I would like to make the requests.seen
'human-readable' by using the scraped URLs (which are sufficiently unique) rather than a hash. However, I would like to omit URLs ending with .xml
(which correspond to sitemap pages).
Like this, the request_fingerprint
method will return None
if the requests URL ends with .xml
. Is this a valid implementation of a dupefilter?
If you look into
request_seen()
method ofDupeFilter
class you can see how scrapy compares fingerprints:fp in self.fingerprints
, in your case this would resolve toNone in {None}
, since your fingerprint isNone
andself.fingerprints
is aset
type object. This is valid python and resolves properly.So yes, you can return
None
.Edit: However this will let through first
xml
response, since thefingerprints
set will not haveNone
fingerprint in it yet. Ideally you want to fixrequest_seen
method in your dupefilter as well to simply returnFalse
if fingerprint isNone
.