Give less weight to term frequency in solr?

2019-08-25 22:38发布

问题:

How do I change the scoring function of Solr to give less weight to "term frequency"?

I am using a pagerank-like document boost as a relevancy factor. My search index currently puts many documents that are "spammy" or not well-cleaned up and have repetitive words on top.

I know the score is calculated by term frequency (how often a search term is in the document), inverse document frequency, and others (How are documents scored?). I could just increase the boost, but that would disemphasize the other factors, too.

Is the way to go to specify a function at query time (and what is the default function), or do I have to change the configuration and reindex? I am using django-haystack with solr, if it makes a difference.

回答1:

I'm not sure this is the best way to do it, but this seems to work. I create a subclass of Similarity in java. In ClassicSimilarity, term frequency is defined as sqrt(freq). It doesn't make sense to add a multiplicative factor, since tf is multiplied with other terms, not added - the scale factor would just be uniformly applied. I.e. scale * a * b doesn't make sense, scale * a + b would. But what you can do in this case is a^scale * b. What this basically does is it applies a scale factor in the logarithm: log(score) = scale * log(a) + log(b).

Also note that the default similarity function doesn't seem to be TF-IDF after all, but BM25. This here is a variation of TF-IDF.

package com.example.solr;
import org.apache.lucene.search.similarities.ClassicSimilarity;

public class CustomSimilarity extends ClassicSimilarity {
    @Override
    public float tf(float freq) {
        return (float) Math.pow(freq, 0.25); // default: 0.5
    }

    @Override
    public String toString() {
        return "CustomSimularity";
    }
}

compile it with:

javac -cp /path/to/solr-6.6.1/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-6.6.1.jar:. -d . CustomSimilarity.java
jar -cvf myscorer.jar com

Then, add to solrconfig.xml:

<lib path="/path/to/myscorer.jar" />

and in schema.xml:

<similarity class="com.example.solr.CustomSimilarity">
</similarity>

After restarting solr, you can verify that the new similarity class is being used under http://localhost:8983/solr/#/<corename>/schema.