Scrapy - How to scrape daily for new pages

2019-03-22 03:24发布

Im evaluating if scrapy is right for me. All I want is to scrape several sports news sites daily for the latest headlines and extract the title, date and article body. I dont care about following links within the body of the article, i just want the body.

As I understand crawling is a one-off job, that crawls the entire site based on links its finds. I dont want to hammer site, and I also dont want to crawl the entire site; just sports section and only the headlines.

So in summary i want scrapy to

  1. once a day find news articles that are different than yesterday from a specified domain
  2. extract new articles date, time and body
  3. save results to a database

Is it possible to do this, if so how would I achieve this. Ive read the tutorial, but it seems the process they describe would search an entire site as a one time job.

1条回答
趁早两清
2楼-- · 2019-03-22 03:57

Take a look at the deltafetch middleware, which is part of a scrapy library of addons published by scrapinghub. It stores the urls of pages that generate Items on disk and will not visit them again. It will still allow scrapy to visit other pages (which is typically needed to find the item pages). This is a pretty simple example that can be customized for your specific needs.

You would need to run your crawl daily (say, using cron) with this middleware enabled.

查看更多
登录 后发表回答