How to deduplicate documents while indexing into e

2019-02-06 21:15发布

I'm using Logstash 1.4.1 together with ES1.01 and would like to replace already indexed documents based on a calculated checksum. I'm currently using the "fingerprint" filter in Logstash which creates a "fingerprint" field based on a specified algorithm. Now - what I want to accomplish is that ES replaces an already existing document based on an identical fingerprint value.

Say, for example that I have a document with a fingerprint-field value of "2c9a6802e10fbcff36177e0b88993f90868fa6fa". Now - if a document with an identical fingerprint value is about to be indexed, I want it to replace the old document already present in the index.

I've tried to add the following to the "elasticsearch-template.json" template file which I assume is used by the Logstash ES-output plugin:

...
  "mappings" : {
    "_default_" : {
       "_id" : {"index": "not_analyzed", "store" : false, "path" : "fingerprint" },
       "_all" : {"enabled" : true},
       "dynamic_templates" : [ {
...

but it doesn't work. What am I doing wrong here?

Cheers

3条回答
该账号已被封号
2楼-- · 2019-02-06 21:42

Assuming the fingerprint is getting set as the _id, you may be hitting an issue with logstash's daily index management and not using the timestamp from your data.

Ensure that you have your timestamp set from the input data, so you are guaranteed the document goes to the correct daily index:

http://logstash.net/docs/1.4.2/filters/date

If my guess is correct, you should see that your duplicate documents have different @timestamp and are in different daily indexes.

查看更多
乱世女痞
3楼-- · 2019-02-06 21:55

I would use the document_id parameter in your logstash elasticsearch output section:

document_id

Value type is string
Default value is nil

The document ID for the index. Useful for overwriting existing entries in Elasticsearch with the same ID.

https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-document_id

I believe the entry should be something like this:

document_id => "%{fingerprint}"

It uses logstash's sprintf format to replace a string with the contents of a field:

https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html#sprintf

查看更多
女痞
4楼-- · 2019-02-06 21:56

You can set the document_id to the value computed by the fingerprint filter, which will place the fingerprint value into the _id field of the document that is written to your index. Since _id must be unique in any given index, any documents written to the same _id value will be overwritten and thus deduplicated.

The following blog posts give examples of how this can be accomplished: https://www.elastic.co/blog/logstash-lessons-handling-duplicates https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Disclaimer: I am a Consulting Engineer at Elastic.

查看更多
登录 后发表回答