I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial:
https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch
and this tutorial:
http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
In order to create a web crawling tool using those aforementioned 3 pieces of software.
Everything works great until it comes down to indexing... as soon as I use the index command from nutch:
# bin/nutch index elasticsearch -all
this happens:
IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
IndexingJob: done.
My nutch-site.xml:
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>AOssama Crawler</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
<name>elastic.host</name>
<value>localhost</value>
</property>
<property>
<name>elastic.cluster</name>
<value>aossama</value>
</property>
<property>
<name>elastic.index</name>
<value>nutch</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.content.limit</name>
<value>6553600</value>
</property>
</configuration>
I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. I'm about to give up and try with Hbase as much as I dislike it.
Thanks!
Joe
Nutch supports both elasticsearch 2.2.0 and mongodb via gora plugin in branch is named 2.x (for mongo backend you should open in $NUTCH_HOME/ivy/ivy.xml)
In addition to this there is information how to upgrade elasticsearch in $NUTCH_HOME/src/plugin/indexer-elastic2/howto_upgrade_es.txt
After a lot of trouble I got it working. I ended up using ES 1.4.4, nutch 2.3.1, mongodb 3.10, and JDK 8.
Many of the issues I went through that remained unanswered in a number of other threads:
./bin/nutch index -all
(after you fetch and parse). If you run into a solr error, you do not have the correct index funtion in your nutch-site.xml.Please, please, please, let me know if you're having any trouble with this. It took me close to 2 full weeks to figure this build out and I know it can be incredibly frustrating. PM me or post on this if you're running into issues, I'm sure I can help you work through them.
Joe