Is there a built-in functionalities in solr/lucene to filter the results if they fall below a certain score threshold? Let's say if I provide a score threshold of .2, then all documents with score less than .2 will be removed from my results. My intuition is that this is possible by updating/customizing solr or lucene.
Could you point me to right direction on how to do this?
Thanks in advance!
You could write your own Collector that would ignore collecting those documents that the scorer places below your threshold. Below is a simple example of this using Lucene.Net 2.9.1.2 and C#. You'll need to modify the example if you want to keep the calculated score.
using System;
using System.Collections.Generic;
using Lucene.Net.Index;
using Lucene.Net.Search;
public class ScoreLimitingCollector : Collector {
private readonly Single _lowerInclusiveScore;
private readonly List<Int32> _docIds = new List<Int32>();
private Scorer _scorer;
private Int32 _docBase;
public IEnumerable<Int32> DocumentIds {
get { return _docIds; }
}
public ScoreLimitingCollector(Single lowerInclusiveScore) {
_lowerInclusiveScore = lowerInclusiveScore;
}
public override void SetScorer(Scorer scorer) {
_scorer = scorer;
}
public override void Collect(Int32 doc) {
var score = _scorer.Score();
if (_lowerInclusiveScore <= score)
_docIds.Add(_docBase + doc);
}
public override void SetNextReader(IndexReader reader, Int32 docBase) {
_docBase = docBase;
}
public override bool AcceptsDocsOutOfOrder() {
return true;
}
}
It's called normalized score (Scores As Percentages).
You can use the following the following parameters to achieve that:
ns = {!func}product(scale(product(query({!type=edismax v=$q}),1),0,1),100)
fq = {!frange l=20}$ns
Where 20 is your 20% threshold.
Related: how do I normalise a solr/lucene score?
I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches). The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall. There are
various approaches to improving this that have been discussed (making
the scores more directly comparable by encoding additional information
into the score and using that for normalization, or probably better,
generalizing the score to an object that contains multiple pieces of
information; e.g. the total number of query terms matched by the top
result if you are using default OR would be quite useful). None of
these ideas are implemented yet as far as I know. - @Chuck
Source: RE: Limiting Hits with a score threshold
Related: Re: A question about scoring function in Lucene
Just an update for anyone who stumbles here - an EarlyTerminatingSortCollector has been provided by Lucene and a custom collector does not need to be made for this anymore. Wrap it over TopDocsCollector (in OP's specific case, TopScoreDocCollector) to achieve the given task.
EarlyTerminatingSortCollector
A Collector that early terminates collection of documents on a per-segment basis, if the segment was sorted according to the given Sort.
TopDocsCollector
A base class for all collectors that return a TopDocs output. This collector allows easy extension by providing a single constructor which accepts a PriorityQueue as well as protected members for that priority queue and a counter of the number of total hits.