Remove results below a certain score threshold in

2019-01-11 07:29发布

问题:

Is there a built-in functionalities in solr/lucene to filter the results if they fall below a certain score threshold? Let's say if I provide a score threshold of .2, then all documents with score less than .2 will be removed from my results. My intuition is that this is possible by updating/customizing solr or lucene.

Could you point me to right direction on how to do this?

Thanks in advance!

回答1:

You could write your own Collector that would ignore collecting those documents that the scorer places below your threshold. Below is a simple example of this using Lucene.Net 2.9.1.2 and C#. You'll need to modify the example if you want to keep the calculated score.

using System;
using System.Collections.Generic;
using Lucene.Net.Index;
using Lucene.Net.Search;

public class ScoreLimitingCollector : Collector {
    private readonly Single _lowerInclusiveScore;
    private readonly List<Int32> _docIds = new List<Int32>();
    private Scorer _scorer;
    private Int32 _docBase;

    public IEnumerable<Int32> DocumentIds {
        get { return _docIds; }
    }

    public ScoreLimitingCollector(Single lowerInclusiveScore) {
        _lowerInclusiveScore = lowerInclusiveScore;
    }

    public override void SetScorer(Scorer scorer) {
        _scorer = scorer;
    }

    public override void Collect(Int32 doc) {
        var score = _scorer.Score();
        if (_lowerInclusiveScore <= score)
            _docIds.Add(_docBase + doc);
    }

    public override void SetNextReader(IndexReader reader, Int32 docBase) {
        _docBase = docBase;
    }

    public override bool AcceptsDocsOutOfOrder() {
        return true;
    }
}


回答2:

It's called normalized score (Scores As Percentages).

You can use the following the following parameters to achieve that:

ns = {!func}product(scale(product(query({!type=edismax v=$q}),1),0,1),100)
fq = {!frange l=20}$ns

Where 20 is your 20% threshold.

Related: how do I normalise a solr/lucene score?


I would not recommend doing this because absolute score values in Lucene are not meaningful (e.g., scores are not directly comparable across searches). The ratio of a score to the highest score returned is meaningful, but there is no absolute calibration for the highest score returned, at least at present, so there is not a way to determine from the scores what the quality of the result set is overall. There are various approaches to improving this that have been discussed (making the scores more directly comparable by encoding additional information into the score and using that for normalization, or probably better, generalizing the score to an object that contains multiple pieces of information; e.g. the total number of query terms matched by the top result if you are using default OR would be quite useful). None of these ideas are implemented yet as far as I know. - @Chuck

Source: RE: Limiting Hits with a score threshold

Related: Re: A question about scoring function in Lucene



回答3:

Just an update for anyone who stumbles here - an EarlyTerminatingSortCollector has been provided by Lucene and a custom collector does not need to be made for this anymore. Wrap it over TopDocsCollector (in OP's specific case, TopScoreDocCollector) to achieve the given task.

EarlyTerminatingSortCollector

A Collector that early terminates collection of documents on a per-segment basis, if the segment was sorted according to the given Sort.

TopDocsCollector

A base class for all collectors that return a TopDocs output. This collector allows easy extension by providing a single constructor which accepts a PriorityQueue as well as protected members for that priority queue and a counter of the number of total hits.



标签: lucene solr