Ok after having pulling my hair off all day long trying to figure that one out I decided to get some input from the community.
Should be mentioned that I'm fairly new to Elasticsearch.
The idea is that I have an ES index containing some documents and I need to index new documents only if no existing documents with similar field content (but not necessarily equals) are already indexed.
I can perform a match query on multiple field and get a global score for the query but since that score is not a percentage of the maximum score available I'm not sure how to set a threshold to determine if I can insert the document or not.
I am obviously a bit confused about the ES scoring system. Thanks in advance for all the help I can get on this.
EDIT:
As a basic example
This is already indexed:
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}
This is new but should not be indexed since fields are not equals but too similar:
{
"title": "My first blog entries",
"text": "Just trying it out...",
"date": "2014/01/01"
}
This is new and should be indexed:
{
"title": "My second entry for this blog",
"text": "I am just trying out a few things",
"date": "2014/01/01"
}
So it's basically deduping prior indexing and based on fields similarity that I am after :)
A perfect solution to your need is the
more_like_this
query.In such query, you can provide artificial documents in the
like
field, that will be matched against documents in your index for similarity. By default they will use all available fields, but you can select a limited number of fields to be compared as well.Most of the time, this query is used to retrieve documents similar to one or a few documents that the user might be looking at, or that the user has selected. Nonetheless, you can probably use this feature to analyze the score of the returned documents (if any) and decide wether to index your document or not.
Please refer to the documentation page linked above for a comprehensive list of parameters.