solr exact search ignore duplicate phrase

2019-09-05 16:30发布

问题:

I'm using solr query to search the keyword from documents. I want exact Phrase to come on top but i also want if same phrase is repeated many times in document then it should be counted one because those keywords having same Phrase multiple times in document coming on top becauase getting high scoring.

Please see the result below given i am searching for "php developer", two results found but both have the different scores.

As per our need both should have the same score. I want to ignore the repeat phrase found in documents.

Please check schema filed also, searching "job_search" field combination of "job_title,key_skills,key_skills_admin,job_detail"

        <copyField source="job_title" dest="job_search"/>
        <copyField source="key_skills" dest="job_search"/>
        <copyField source="key_skills_admin" dest="job_search"/>   
        <copyField source="job_detail" dest="job_search"/> 

        {
        "responseHeader":{
        "status":0,
        "QTime":7,
        "params":{
          "lowercaseOperators":"true",
          "mm":"2",
          "debugQuery":"true",
          "fl":"job_slno,job_title,job_detail,key_skills,key_skills_admin,display_date,score",
          "indent":"true",
          "q":"\"php developer\"",
          "stopwords":"true",
          "wt":"json",
          "defType":"edismax"}},
        "response":{"numFound":110,"start":0,"maxScore":2.518858,"docs":[
          {
            "job_slno":"243681",
            "job_title":"php developer",
            "job_detail":"sdf sdfs df",
            "key_skills":"php developer",
            "key_skills_admin":"php developer",
            "display_date":"2016-11-11T00:00:00Z",
            "score":2.518858},
          {
            "job_slno":"243340",
            "job_title":"sfsdfs",
            "job_detail":"dfsdfsdfsd",
            "key_skills":"PHP Developer",
            "key_skills_admin":"PHP Developer",
            "display_date":"2016-11-13T00:00:00Z",
            "score":2.399412},
          ]
        }

回答1:

As long as you're not dependent on the position of the tokens (as in you're not doign phrase boosting or something similar), you can set omitTermFreqAndPositions to true for the field.

That will avoid storing any information about the term frequency and inherently make the score identical as long as the term frequency is the only differing factor.



回答2:

You can create your own custom Similarity class extending DefaultSimilarity. And override the tf method as per your use case.

public class CustomSimilarity extends DefaultSimilarity {

        //multiple occurrences of terms doesn't affect its relevancy
        @Override
        public float tf(float freq) {
                return 1;
        }
}


标签: solr