I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt :
+^https://www.mywebsite.com/abc-def/(.+)*$
When I try to run the following crawl command :
**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**
It crawl and index just one seed.txt url and in 2nd iteration it just say:
Generator: starting at 2017-02-28 09:51:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
When I change the regex-urlfilter.txt to allow everything(+.) it started indexing every URL on https://www.mywebsite.com which certainly I don't want.
If anyone happen to have the same problem, please share how you get past it.