可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt :

+^https://www.mywebsite.com/abc-def/(.+)*$

When I try to run the following crawl command :

**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**

It crawl and index just one seed.txt url and in 2nd iteration it just say:

Generator: starting at 2017-02-28 09:51:36

Generator: Selecting best-scoring urls due for fetch.

Generator: filtering: false

Generator: normalizing: true

Generator: topN: 50000

Generator: 0 records selected for fetching, exiting ...

Generate returned 1 (no new segments created)

Escaping loop: no more URLs to fetch now

When I change the regex-urlfilter.txt to allow everything(+.) it started indexing every URL on https://www.mywebsite.com which certainly I don't want.

If anyone happen to have the same problem, please share how you get past it.

回答1:

Got that working after trying multiple things in last 2 days.Here is the solution:

Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

回答2:

You may try to tweak properties available in conf/nutch-default.xml. maybe control the number of outlinks your want or modify fetch properties. If you decide to overwrite any property, copy that info to conf/nutch-site.xml and put new value there.