Using Nutch how to crawl the dynamic content of we

2019-07-05 00:08发布

I am using apache Nutch 1.10 to crawl the web pages and to extract the contents in the page. Some of the links contains dynamic contents which are loaded on the call of ajax. Nutch cannot able to crawl and extract the dynamic contents of ajax. How can I solve this? Is there any solution? if yes please help me with your answers.

Thanks in advance.

标签： java ajax plugins web-crawler nutch

2条回答

走好不送

2楼-- · 2019-07-05 00:30

Checkout the latest Nutch 1.11 trunk which includes a new plugin protocol-interactive selenium. (https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium)

This plugin allows you to write your own handler and execute javascript to get dynamic content.

0人赞添加讨论(0) 举报

疯言疯语

3楼-- · 2019-07-05 00:45

Most web crawler libraries do not offer javascript rendering out of the box. You usually have to plugin another library or product that offers js rendering like Selenium or PhantomJS.

Here is a tutorial using nutch and Selenium.

0人赞添加讨论(0) 举报

Using Nutch how to crawl the dynamic content of we

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间