How do I change the scoring function of Solr to give less weight to "term frequency"?
I am using a pagerank-like document boost as a relevancy factor. My search index currently puts many documents that are "spammy" or not well-cleaned up and have repetitive words on top.
I know the score is calculated by term frequency (how often a search term is in the document), inverse document frequency, and others (How are documents scored?). I could just increase the boost, but that would disemphasize the other factors, too.
Is the way to go to specify a function at query time (and what is the default function), or do I have to change the configuration and reindex? I am using django-haystack with solr, if it makes a difference.
I'm not sure this is the best way to do it, but this seems to work. I create a subclass of Similarity
in java. In ClassicSimilarity
, term frequency is defined as sqrt(freq)
. It doesn't make sense to add a multiplicative factor, since tf is multiplied with other terms, not added - the scale factor would just be uniformly applied. I.e. scale * a * b
doesn't make sense, scale * a + b
would. But what you can do in this case is a^scale * b
. What this basically does is it applies a scale factor in the logarithm: log(score) = scale * log(a) + log(b)
.
Also note that the default similarity function doesn't seem to be TF-IDF after all, but BM25. This here is a variation of TF-IDF.
package com.example.solr;
import org.apache.lucene.search.similarities.ClassicSimilarity;
public class CustomSimilarity extends ClassicSimilarity {
@Override
public float tf(float freq) {
return (float) Math.pow(freq, 0.25); // default: 0.5
}
@Override
public String toString() {
return "CustomSimularity";
}
}
compile it with:
javac -cp /path/to/solr-6.6.1/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-6.6.1.jar:. -d . CustomSimilarity.java
jar -cvf myscorer.jar com
Then, add to solrconfig.xml
:
<lib path="/path/to/myscorer.jar" />
and in schema.xml
:
<similarity class="com.example.solr.CustomSimilarity">
</similarity>
After restarting solr, you can verify that the new similarity class is being used under http://localhost:8983/solr/#/<corename>/schema
.