Ok after having pulling my hair off all day long trying to figure that one out I decided to get some input from the community.
Should be mentioned that I'm fairly new to Elasticsearch.
The idea is that I have an ES index containing some documents and I need to index new documents only if no existing documents with similar field content (but not necessarily equals) are already indexed.
I can perform a match query on multiple field and get a global score for the query but since that score is not a percentage of the maximum score available I'm not sure how to set a threshold to determine if I can insert the document or not.
I am obviously a bit confused about the ES scoring system. Thanks in advance for all the help I can get on this.
EDIT:
As a basic example
This is already indexed:
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}
This is new but should not be indexed since fields are not equals but too similar:
{
"title": "My first blog entries",
"text": "Just trying it out...",
"date": "2014/01/01"
}
This is new and should be indexed:
{
"title": "My second entry for this blog",
"text": "I am just trying out a few things",
"date": "2014/01/01"
}
So it's basically deduping prior indexing and based on fields similarity that I am after :)