Nutch does not Index on Elasticsearch correctly us

2020-07-27 05:37发布

问题:

I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial:

https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch

and this tutorial:

http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

In order to create a web crawling tool using those aforementioned 3 pieces of software.

Everything works great until it comes down to indexing... as soon as I use the index command from nutch:

# bin/nutch index elasticsearch -all

this happens:

IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port (default 9300)
        elastic.index : elastic index command
        elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

IndexingJob: done.

My nutch-site.xml:

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>AOssama Crawler</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>aossama</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
</configuration>

I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. I'm about to give up and try with Hbase as much as I dislike it.

Thanks!

Joe

回答1:

After a lot of trouble I got it working. I ended up using ES 1.4.4, nutch 2.3.1, mongodb 3.10, and JDK 8.

Many of the issues I went through that remained unanswered in a number of other threads:

  • (this is an easy one but...) MAKE SURE EVERYTHING IS RUNNING. Make sure elasticsearch is running on the correct machine with the correct port. Make sure you can talk to it. Make sure MongoDB is up and running on the correct port, make sure you can talk to it.
  • Use the correct index command. for Nutch 3.2.1 it's: ./bin/nutch index -all (after you fetch and parse). If you run into a solr error, you do not have the correct index funtion in your nutch-site.xml.
  • Name your crawler engine the SAME THING in your elasticsearch.yml and your nutch-site.xml. This was huge. This is the main reason I had any error thrown in my index function.
  • Versioning. I tried to do this with the newer versions of Elasticsearch and frequently ran into problems. I am going to attempt to build this on the newest version of Elasticsearch and Mongo and get back to this thread. Try to use the same build I did first, then attempt the other builds. Elasticsearch versioning with nutch seems to be the most important part because of the dependencies regarding gora in the ivy/ivy.xml settings as well as the indexer-elastic/plugin.xml settings.

Please, please, please, let me know if you're having any trouble with this. It took me close to 2 full weeks to figure this build out and I know it can be incredibly frustrating. PM me or post on this if you're running into issues, I'm sure I can help you work through them.

Joe



回答2:

Nutch supports both elasticsearch 2.2.0 and mongodb via gora plugin in branch is named 2.x (for mongo backend you should open in $NUTCH_HOME/ivy/ivy.xml)

<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />

In addition to this there is information how to upgrade elasticsearch in $NUTCH_HOME/src/plugin/indexer-elastic2/howto_upgrade_es.txt

  1. Upgrade elasticsearch dependency in $NUTCH_HOME/src/plugin/indexer-elastic2/ivy.xml

  2. Upgrade the Elasticsearch specific dependencies in src/plugin/indexer-elastic2/plugin.xml To get the list of dependencies and their versions execute:

$ ant -f ./build-ivy.xml
$ ls lib | sed 's/^/      <library name="/g' | sed 's/$/"\/>/g'