Solr score keyword detection rate

2019-08-28 00:20发布

问题:

I'm used Solr 6.1

I'm setting the score now,

But I have some issue on score

I just search GCS, the qf set is: title^100 content^70 text^50,

The three fields type all are text_general,

I get first one result score is 1050.8486 and another is 853.08655,

But the first one content is so short in content field and another one is so many in content field,

I just not know why the first score will be many

Two results debugquery content below:

1002.8741 = sum of:\n 1002.8741 = max of:\n 1002.8741 = weight(title:GCS in 1275) [], result of:\n 1002.8741 = score(doc=1275,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 1.177973 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n 4.0 = fieldLength\n 928.3479 = weight(content:GCS in 1275) [], result of:\n 928.3479 = score(doc=1275,freq=2.0 = termFreq=2.0\n), product of:\n 70.0 = boost\n 7.1785564 = idf(docFreq=104, docCount=137000)\n 1.8474623 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 176.37256 = avgFieldLength\n 16.0 = fieldLength\n


811.1335 = sum of:\n 811.1335 = max of:\n 127.21202 = weight(text:GCS in 9400) [], result of:\n 127.21202 = score(doc=9400,freq=1.0 = termFreq=1.0\n), product of:\n 50.0 = boost\n 7.464645 = idf(docFreq=78, docCount=137000)\n 0.3408388 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 44.69738 = avgFieldLength\n 256.0 = fieldLength\n 811.1335 = weight(title:GCS in 9400) [], result of:\n 811.1335 = score(doc=9400,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 0.9527551 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n 7.111111 = fieldLength\n 174.06395 = weight(content:GCS in 9400) [], result of:\n 174.06395 = score(doc=9400,freq=7.0 = termFreq=7.0\n), product of:\n 70.0 = boost\n 7.1785564 = idf(docFreq=104, docCount=137000)\n 0.34639663 = tfNorm, computed from:\n 7.0 = termFreq=7.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 176.37256 = avgFieldLength\n 7281.778 = fieldLength\n

===========================================================================

I have another question when I use shards the omitNorms it'll not work? why? I found short content score more to long content? the schema is same

First one is from A collection is short content,The other one is B collection and long content :

1158.9161 = sum of:\n 1158.9161 = max of:\n 1158.9161 = weight(title:boeing in 52601) [], result of:\n 1158.9161 = score(doc=52601,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 11.589161 = idf(docFreq=5, docCount=593568)\n 1.0 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 1085.6042 = weight(content:boeing in 52601) [], result of:\n 1085.6042 = score(doc=52601,freq=2.0 = termFreq=2.0\n), product of:\n 70.0 = boost\n 11.279006 = idf(docFreq=7, docCount=593568)\n 1.375 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n


1060.8777 = sum of:\n 1060.8777 = max of:\n 433.1234 = weight(text:boeing in 39406) [], result of:\n 433.1234 = score(doc=39406,freq=1.0 = termFreq=1.0\n), product of:\n 50.0 = boost\n 8.662468 = idf(docFreq=112, docCount=650450)\n 1.0 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 884.746 = weight(title:boeing in 39406) [], result of:\n 884.746 = score(doc=39406,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.84746 = idf(docFreq=93, docCount=650450)\n 1.0 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 1060.8777 = weight(content:boeing in 39406) [], result of:\n 1060.8777 = score(doc=39406,freq=7.0 = termFreq=7.0\n), product of:\n 70.0 = boost\n 8.069756 = idf(docFreq=203, docCount=650450)\n 1.8780489 = tfNorm, computed from:\n 7.0 = termFreq=7.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)

回答1:

The underline similarity Solr 6.1 uses is BM25[1] .

This means the field value length in comparison to the average field length is important. Being more specific, you are using the dismax and you keep in consideration just purely the maximum. So exploring the maximums :

First document Max:

1002.8741 = weight(title:GCS in 1275) [], result of:\n 1002.8741 = score(doc=1275,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 1.177973 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n 4.0 = fieldLength\n

Second document Max:

811.1335 = weight(title:GCS in 9400) [], result of:\n 811.1335 = score(doc=9400,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 0.9527551 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n 7.111111 = fieldLength\n

So the shorter first document title makes the winner. You can play with the dismax/edismax to take in consideration also other factors and not only the maximum[2].

Regards

[1] http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

[2] https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter



标签: solr