Facing some serious issues while using nutch and elasticsearch for crawling purpose.
We have two data storage engines in our App.
MySql
Elasticsearch
Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.
My table structure in mysql is :
Table Url:
id url
1 www.google.com
Elasticsearch index mapping I want is :
Index url:
{ _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }
Here url_id is the field value of id column of the crawled url inside urls table.
I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.
You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.
See Nutch WIKI for an explanation of how to index metatags.