Elasticsearch: match every position only once

In my Elasticsearch index I have documents that have multiple tokens at the same position.

I want to get a document back when I match at least one token at every position. The order of the tokens is not important. How can I accomplish that? I use Elasticsearch 0.90.5.

Example:

I index a document like this.

{
    "field":"red car"
}

I use a synonym token filter that adds synonyms at the same positions as the original token. So now in the field, there are 2 positions:

Position 1: "red"
Position 2: "car", "automobile"

My solution for now:

To be able to ensure that all positions match, I index the maximum position as well.

{
    "field":"red car",
    "max_position": 2
}

I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.

Query:

{
    "custom_score": {
        "query": {
             "match": {
                 "field": "a car is an automobile"
             }
        },
        "_script": "_score*100/doc[\"max_position\"]+_score"
    },
    "min_score":"100"
}

Problem with my solution:

The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".

标签： lucene position elasticsearch

1条回答

男人必须洒脱

2楼-- · 2019-04-28 07:40

If you needed to guarantee 100% matches against the query terms you could use minimum_should_match. This is the more common case.

Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:

Per document/field scanned in the query scorer:

Number of analyzed terms matched (overlap is the Lucene terminology, it is used the the coord() method of the DefaultSimilarity class)
Number of total analyzed terms in the field: Look at this thread for a couple different ways to get this information: How to count the number of terms for each document in lucene index?

Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.

Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.

0人赞添加讨论(0) 举报

Elasticsearch: match every position only once

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间