How to find if a document is a good match for a qu

2020-07-28 00:07发布

The score computed by Elasticsearch provides a ranking between the documents, but it does not tell if the documents are a good match for the request. Currently, the first document can either match on all fields or just one. The only information that the score provides is that it is the best match.

Would it be possible to get a normalized score with respect to the query ? For example, a score of 1 would be a document matching perfectly the query and a score of 0.1 a document matching poorly.

1条回答
ら.Afraid
2楼-- · 2020-07-28 00:21

In short, no, it is not possible to get a real normalized score for a query, but it is possible to get a good enough score normalization that works in many cases.

The problem to get a score that tells if the document is a good match or not for a query is to find what would be the best document for this query, and consequently the maximum score. Using elasticsearch and most (if not all) metrics, the maximum score is not bounded.

Even with a simple match query, you can technically reach an infinite score with a document that repeat the queried term an infinite number of time. Without bound on the score, it is not possible to get a true normalized score.

But all hopes are not lost. Instead of normalizing against the best possible score you can normalize against a fake ideal document which is supposed to get the maximum score. For example, if you are querying two fields name and occupation with queried terms Jane Doe and Cook your ideal document can be

{
    "name": "Jane Doe",
    "occupation": "Cook"
}

If the index contains a document with for example the name Jane Jane Doe then the ideal document may not get the maximum score. If the queried fields are relatively short, you probably do not have to worry about term duplication. If you have fields with many terms you may decide to duplicate some terms which are frequent in the ideal document. If the objective is to find if the document is a good match or not, it is usually not a problem to have a document scored higher than the ideal document.

The good news is that if you are using at least elasticsearch 6.4 you do not have to index the fake document to get its score for a query. You may use the endpoint _scripts/painless/_execute to obtain the score of the ideal document.

GET _scripts/painless/_execute
{
    "script": {
        "source": "_score"
    },
    "context": "score",
    "context_setup": {
        "index": <INDEX>,
        "document": <THE_IDEAL_DOCUMENT>,
        "query": <YOUR_QUERY>
    }
}

Please note that the fields statistics of the fake document such as the number of documents containing a field and the number of fields containing the queried term will be taken into account when computing the score. If you have many documents, this should not be a problem, but for very not frequent field or term (say below 20) you can notice a lower score for the ideal document compared to a previously indexed document.

查看更多
登录 后发表回答