Apache Nutch not adding internal links in a web pa

2019-07-27 05:10发布

问题:

I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.

However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.

On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.

This problem is not present in any other domains , can some tell me what is it with this particular page ?

I have also attached the nucth-site.xml that I am using for your review, please advise.

回答1:

Your seed url is being ignored by the default filters, so your page is not being crawled.

Edit the following files:

conf/automaton-urlfilter.txt

conf/regex-urlfilter.txt

Replace

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

With

# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*