I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial:
https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch
and this tutorial:
http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
In order to create a web crawling tool using those aforementioned 3 pieces of software.
Everything works great until it comes down to indexing... as soon as I use the index command from nutch:
# bin/nutch index elasticsearch -all
this happens:
IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
IndexingJob: done.
My nutch-site.xml:
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>AOssama Crawler</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
<name>elastic.host</name>
<value>localhost</value>
</property>
<property>
<name>elastic.cluster</name>
<value>aossama</value>
</property>
<property>
<name>elastic.index</name>
<value>nutch</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.content.limit</name>
<value>6553600</value>
</property>
</configuration>
I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. I'm about to give up and try with Hbase as much as I dislike it.
Thanks!
Joe