Scrapy - Spider crawls duplicate urls

2019-02-25 10:20发布

问题:

I'm crawling a search results page and scrape title and link information from the same page. As its a Search page, I have the links to the next pages as well, which I have specified in the SgmlLinkExtractor to allow.

The description of the problem is, In 1st page, i have found the links of Page2 and Page3 to crawl and it does perfectly. But when it is crawls 2nd page, it again has links to Page1(previous page) and Page3(next page). SO it again crawls Page1 with referrer as Page2 and its going in loop.

The scrapy version, I use is 0.17.

I have searched through web for answers and tried the following, 1)

Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", unique= True, follow= True),

But the unique command was not indentified as a valid parameter.

2) I have tried to specify default filter in settings as DUPEFILTER_CLASS = RFPDupeFilter

    DUPEFILTER_CLASS = RFPDupeFilter
NameError: name 'RFPDupeFilter' is not defined

3) I have also tried to have a custom filter, which I found the snippet in web, but do not understand much from it. The code is as follows. The visit id and status is captured but it doesn't identify the already crawled pages.

Note: the snippet is copied from web and i dont have much details of this,

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from Amaze.items import AmazeItem

class IgnoreVisitedItems(object):
    FILTER_VISITED = 'filter_visited'
    VISITED_ID = 'visited_id'
    CONTEXT_KEY = 'visited_ids'

    def process_spider_output(self, response, result, spider):
        context = getattr(spider, 'context', {})
        visited_ids = context.setdefault(self.CONTEXT_KEY, {})
        ret = []
        for x in result:
            visited = False
            if isinstance(x, Request):
                if self.FILTER_VISITED in x.meta:
                    visit_id = self._visited_id(x)
                    if visit_id in visited_ids:
                        log.msg("Ignoring already visited: %s" % x.url,
                                level=log.INFO, spider=spider)
                        visited = True
            elif isinstance(x, BaseItem):
                visit_id = self._visited_id(response.request)
                if visit_id:
                    visited_ids[visit_id] = True
                    x['visit_id'] = visit_id
                    x['visit_status'] = 'new'
            if visited:
                ret.append(MyItem(visit_id=visit_id, visit_status='old'))
            else:
                ret.append(x)
        return ret

    def _visited_id(self, request):
        return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

My intention is to have the spider itself ignore the already crawled webpages, rather than having the crawled pages in a list and matching every time a page is cralwed or not with the list.

Any thoughts on this please.

回答1:

You are not given a code example of your Spider, but, possible, you pass argument dont_filter = True, when calling Request method. Try to specify explicitly Request(dont_filter=False). This points out to Spider that he don't must repeat identical requests.