Apache Nutch not adding internal links in a web pa

2019-07-27 04:22发布

I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.

However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.

On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.

This problem is not present in any other domains , can some tell me what is it with this particular page ?

I have also attached the nucth-site.xml that I am using for your review, please advise.

标签： web-crawler nutch

1条回答

该账号已被封号

2楼-- · 2019-07-27 05:20

Your seed url is being ignored by the default filters, so your page is not being crawled.

Edit the following files:

conf/automaton-urlfilter.txt

conf/regex-urlfilter.txt

Replace

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

With

# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*

0人赞添加讨论(0) 举报

Apache Nutch not adding internal links in a web pa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间