How to prevent duplicates on Scrapy fetching depen

2019-08-21 09:40发布

In this Spider

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'Reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://old.reddit.com']

    def parse(self, response):

        for link in response.css('li.first a.comments::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)



    def parse_topics(self, response):
        topics = {}
        topics["title"] = response.css('a.title::text').extract_first()
        topics["author"] = response.css('p.tagline a.author::text').extract_first()

        if response.css('div.score.likes::attr(title)').extract_first() is not None:
            topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
        else:
            topics["score"] = "0"

        if int(topics["score"]) > 10000:
            author_url = response.css('p.tagline a.author::attr(href)').extract_first()
            yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
        else:
            yield topics

    def parse_user(self, response):
        topics = response.meta.get('topics')

        users = {}
        users["name"] = topics["author"]
        users["karma"] = response.css('span.karma::text').extract_first()

        yield users
        yield topics

I get these results:

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  ....
]

, But I run this Spider everyday to get the last of this week, So if for example today is the 7th day of the week, I get a duplicate of 6 days before within today like this

day1: result_day1
day2: result_day2, result_day1
day3: result_day3, result_day2, result_day1
. . . . . . .
day7: result_day7, result_day6, result_day5, result_day4, result_day3, result_day2, result_day1

All the data is stored in a JSON file as shown before, What I want to do is to tell the Spider to check of the fetched result already exists in the JSON file, If it is, Then it skips it, If it is not, then it is added to the file,

Is that possible using Scrapy?

For example:

if yesterday (06.json) results was

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
]

And today (07.json) results are

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  {"title": "ExampleTitle5", "author": "Username5", "score": "16700"}
]

the Result of today's List (07.json) to be

[
  {"title": "ExampleTitle5", "author": "Username5", "score": "16700"}
]

after filtering

1条回答
干净又极端
2楼-- · 2019-08-21 09:56

Scrapy provides really only one way it looks for 'duplicates' (for data, not duplicate requests made): collecting data by using items in an Item Pipeline and using a duplicate filter. See:

https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter

It drops items when a duplicate is detected. I have two problems with this approach: (1) you have to write the duplicate filter method to define what a duplicate is based on the data your working with, and (2) this method really only helps for checking duplicates in the same 'run' of the spider.

An alternate approach for running the spider between days is to persist data between runs. See:

https://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches

Using this approach, your spider.state would be the data from the last run (from the previous day). Then, when you run the spider again, you know what data you got from the last run. So, you can implement logic to pull data that is only unique to the current day (timestamp the data for each day and use the last day as a comparison). You could quickly implement this. And, this might be good enough to solve your issue.

But, this approach would get unruly if you had to compare data over the course of all days prior to the current day. This means that you would make your spider persist data for all days in the week prior to the current one. So, your spider.state dictionary (which would just be the JSON results for each day) would get really large as it is filled with the data from all days prior to day 7, for example.

If you need to make that the data added for the current day is unique compared to all days before it, I would ditch Scrapy's built in mechanisms entirely. I would just write all the data to a database with timestamps of when the data was scraped. You could then use database queries to find out what unique data was added for each individual day.

查看更多
登录 后发表回答