Facing issue in elasticsearch mapping of nutch cra

2019-05-23 03:38发布

Facing some serious issues while using nutch and elasticsearch for crawling purpose.

We have two data storage engines in our App.

MySql
Elasticsearch

Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.

My table structure in mysql is :

Table Url:

id url

1 www.google.com

Elasticsearch index mapping I want is :

Index url:

{ _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }

Here url_id is the field value of id column of the crawled url inside urls table.

I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.

标签： mysql elasticsearch web-crawler nutch

1条回答

趁早两清

2楼-- · 2019-05-23 04:18

You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.

See Nutch WIKI for an explanation of how to index metatags.

0人赞添加讨论(0) 举报

Facing issue in elasticsearch mapping of nutch cra

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间