Scrapy - How to scrape daily for new pages

2019-03-22 03:24发布

Im evaluating if scrapy is right for me. All I want is to scrape several sports news sites daily for the latest headlines and extract the title, date and article body. I dont care about following links within the body of the article, i just want the body.

As I understand crawling is a one-off job, that crawls the entire site based on links its finds. I dont want to hammer site, and I also dont want to crawl the entire site; just sports section and only the headlines.

So in summary i want scrapy to

once a day find news articles that are different than yesterday from a specified domain
extract new articles date, time and body
save results to a database

Is it possible to do this, if so how would I achieve this. Ive read the tutorial, but it seems the process they describe would search an entire site as a one time job.

标签： html-parsing web-scraping scrapy

1条回答

趁早两清

2楼-- · 2019-03-22 03:57

Take a look at the deltafetch middleware, which is part of a scrapy library of addons published by scrapinghub. It stores the urls of pages that generate Items on disk and will not visit them again. It will still allow scrapy to visit other pages (which is typically needed to find the item pages). This is a pretty simple example that can be customized for your specific needs.

You would need to run your crawl daily (say, using cron) with this middleware enabled.

0人赞添加讨论(0) 举报

Scrapy - How to scrape daily for new pages

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间