Apache Nutch not adding internal links in a web pa

2019-07-27 04:22发布

I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.

However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.

On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.

This problem is not present in any other domains , can some tell me what is it with this particular page ?

I have also attached the nucth-site.xml that I am using for your review, please advise.

1条回答
该账号已被封号
2楼-- · 2019-07-27 05:20

Your seed url is being ignored by the default filters, so your page is not being crawled.

Edit the following files:

conf/automaton-urlfilter.txt

conf/regex-urlfilter.txt

Replace

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

With

# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*
查看更多
登录 后发表回答