In my Elasticsearch index I have documents that have multiple tokens at the same position.
I want to get a document back when I match at least one token at every position. The order of the tokens is not important. How can I accomplish that? I use Elasticsearch 0.90.5.
Example:
I index a document like this.
{
"field":"red car"
}
I use a synonym token filter that adds synonyms at the same positions as the original token. So now in the field, there are 2 positions:
- Position 1: "red"
- Position 2: "car", "automobile"
My solution for now:
To be able to ensure that all positions match, I index the maximum position as well.
{
"field":"red car",
"max_position": 2
}
I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.
Query:
{
"custom_score": {
"query": {
"match": {
"field": "a car is an automobile"
}
},
"_script": "_score*100/doc[\"max_position\"]+_score"
},
"min_score":"100"
}
Problem with my solution:
The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".
If you needed to guarantee 100% matches against the query terms you could use
minimum_should_match
. This is the more common case.Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:
Per document/field scanned in the query scorer:
Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.
Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.