Facing issue in elasticsearch mapping of nutch cra

2019-05-23 04:21发布

问题:

Facing some serious issues while using nutch and elasticsearch for crawling purpose.

We have two data storage engines in our App.

  1. MySql

  2. Elasticsearch

Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.

My table structure in mysql is :

Table Url:

id url


1 www.google.com

Elasticsearch index mapping I want is :

Index url:

{ _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }

Here url_id is the field value of id column of the crawled url inside urls table.

I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.

回答1:

You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.

See Nutch WIKI for an explanation of how to index metatags.