Is it OK for Scrapy's request_fingerprint meth

2019-09-15 05:55发布

I'd like to override Scrapy's default RFPDupefilter class as follows:

from scrapy.dupefilters import RFPDupeFilter

class URLDupefilter(RFPDupeFilter):

    def request_fingerprint(self, request):
        if not request.url.endswith('.xml'):
            return request.url

The rationale is that I would like to make the requests.seen 'human-readable' by using the scraped URLs (which are sufficiently unique) rather than a hash. However, I would like to omit URLs ending with .xml (which correspond to sitemap pages).

Like this, the request_fingerprint method will return None if the requests URL ends with .xml. Is this a valid implementation of a dupefilter?

标签: python scrapy
1条回答
放我归山
2楼-- · 2019-09-15 06:42

If you look into request_seen() method of DupeFilter class you can see how scrapy compares fingerprints:

def request_seen(self, request):
    fp = self.request_fingerprint(request)
    if fp in self.fingerprints:
        return True
    self.fingerprints.add(fp)
    if self.file:
        self.file.write(fp + os.linesep)

fp in self.fingerprints, in your case this would resolve to None in {None}, since your fingerprint is None and self.fingerprints is a set type object. This is valid python and resolves properly.
So yes, you can return None.

Edit: However this will let through first xml response, since the fingerprints set will not have None fingerprint in it yet. Ideally you want to fix request_seen method in your dupefilter as well to simply return False if fingerprint is None.

查看更多
登录 后发表回答