apache nutch crawler - keeps retrieve only single

2019-09-02 04:02发布

问题:

INJECT step keeps retrieving only single URL - trying to crawl CNN. I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value?

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>crawler1</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
        <name>solr.server.url</name>
        <value>http://x.x.x.x:8983/solr/collection1</value>
  </property>
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-reg
ex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|m
etatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10</value>
</property>
</configuration>

回答1:

Nutch crawl consists of 4 basic steps: Generate, Fetch, Parse and Update DB. These steps are the same for both nutch 1.x and nutch 2.x. Execution and completion of all four steps make one crawl cycle.

Injector can be the very first step that adds the URL to the crawldb; as stated here and here.

To populate initial rows for the webtable you can use the InjectorJob.

Which I reckon you have already provided i.e cnn.com

generate.max.count limits the number of URLs to be fetched form the single domain as stated here.

Now what matters is how many URLs from cnn.com your crawldb has.

Option 1

You have generate.max.count = 10 and you have seeded or injected more than 10 URLs to the crawldb then on executing crawl cycle, nutch should fetch no more than 10 URLs

Option 2

If you have injected only one URL and you have performed only one crawl cycle then on first cycle you will get only one document processed because only one URL was in your crawldb. Your crawldb will be update at the end of each crawl cycle. So on execution of your second crawl cycle and third crawl cycle and so on, nutch should resolve only up to 10 URLs from a specific domain.