How to increase number of documents fetched by Apa

2019-04-14 23:48发布

问题:

I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start.

How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents? Second, how I can count number of documents crawled per day by nutch?

回答1:

One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for detailed information, read my answer here.

Whats causing limited URL fetch can be caused by the following factors:

Number of Crawl cycles:

If you are only executing one crawl cycle then you will get few results as the URLs injected or seeded into crawldb will be fetched initially. On progressive crawl cycles your crawldb will updated with new URLs extracted from previously fetched pages.

topN value:

As mentioned here and here, topN value cause nutch to fetch the limited number of URLs on each cycle. If you have small topN value, you will get less number of pages.

generate.max.count

generate.max.count in your nutch configuration file i.e nutch-default.xml or nutch-site.xml limits the number of URLs to be fetched form the single domain as stated here.

Answer to your second question on how to count number of pages crawled per day. What you can do is to read the log files. From there you can accumulate the information on the number of pages crawled per day.

In nutch 1.x log file is generated in log folder NUTCH_HOME/logs/hadoop.log

You can count the lines with respect to date and status "fetching" from the logs like this:

cat logs/hadoop.log | grep -i 2016-05-26.*fetching | wc -l