Re-crawling websites fast

2019-04-02 13:33发布

I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this portals will be indexed for searching. The problem is in re-crawling this portals - first crawling of portal takes very long (examples of portals: www.onet.pl, www.bankier.pl, www.gazeta.pl ) and I want to re-crawl it faster (as fast as it is possible) for example by checking date of modification but I have used wget to download www.bankier.pl but in response it complains that there is no last-modification header. Is there any way to re-crawl so many sites? I have also tried using Nutch but script for re-clawing seems not work properly - or it also depends on this headers (last-modified). Maybe there is a tool, crawler (like Nutch or something) that can update already downloaded sites by adding new one??

Best regards, Wojtek

标签： wget web-crawler nutch

2条回答

▲ chillily

2楼-- · 2019-04-02 14:13

I recommend using curl to fetch only the head and check if the Last-Modified header has changed.

Example:

 curl --head www.bankier.pl

0人赞添加讨论(0) 举报

该账号已被封号

3楼-- · 2019-04-02 14:33

For Nutch, I have written a blog post on how to re-crawl with Nutch. Basically, you should set a low value for the db.fetch.interval.default setting. On the next fetch of a url, Nutch will use the last fetch time as the value for the If-Modified-Since HTTP header.

0人赞添加讨论(0) 举报

Re-crawling websites fast

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间