I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.
However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.
On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.
This problem is not present in any other domains , can some tell me what is it with this particular page ?
I have also attached the nucth-site.xml that I am using for your review, please advise.
Your seed url is being ignored by the default filters, so your page is not being crawled.
Edit the following files:
conf/automaton-urlfilter.txt
conf/regex-urlfilter.txt
Replace
With