I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start.
How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents?
Second, how I can count number of documents crawled per day by nutch?
One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for detailed information, read my answer here.
Whats causing limited URL fetch can be caused by the following factors:
Number of Crawl cycles:
If you are only executing one crawl cycle then you will get few results as the URLs injected or seeded into crawldb will be fetched initially. On progressive crawl cycles your crawldb will updated with new URLs extracted from previously fetched pages.
topN value:
As mentioned here and here, topN value cause nutch to fetch the limited number of URLs on each cycle. If you have small topN value, you will get less number of pages.
generate.max.count
generate.max.count
in your nutch configuration file i.e nutch-default.xml
or nutch-site.xml
limits the number of URLs to be fetched form the single domain as stated here.
Answer to your second question on how to count number of pages crawled per day. What you can do is to read the log files. From there you can accumulate the information on the number of pages crawled per day.
In nutch 1.x log file is generated in log folder NUTCH_HOME/logs/hadoop.log
You can count the lines with respect to date and status "fetching" from the logs like this:
cat logs/hadoop.log | grep -i 2016-05-26.*fetching | wc -l