Nutch crawl no error , but result is nothing

2019-04-14 07:06发布

问题:

I try to crawl some urls with nutch 2.1 as follows.

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

http://wiki.apache.org/nutch/NutchTutorial

There is no error , but undermentioned folders don't be made.

crawl/crawldb
crawl/linkdb
crawl/segments

Can anyone help me? I have not resolved this trouble for two days. Thanks a lot!

output is as follows.

FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=9
-finishing thread FetcherThread0, activeThreads=8
-finishing thread FetcherThread1, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread3, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all

runtime/local/conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
<property>
  <name>http.robots.agents</name>
  <value>My Nutch Spider</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>262144</value>
</property>
</configuration>

runtime/local/conf/regex-urlfilter.txt

# accept anything else
+.

runtime/local/urls/seed.txt

http://nutch.apache.org/

回答1:

As you are using Nutch 2.X, you need to follow the relevant tutorial. The one that you gave was for Nutch 1.x. Nutch 2.X uses external storage backends like HBase, Cassandra so the crawldb, segments etc directories wont be formed.

Also, use bin/crawl script instead of the bin/nutch command.