If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:
I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.
Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke
Person 2: David Letterman
Person 3: David Hasselhoff, David Michael Hasselhoff
If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?
You can just run your query
q=field_name:David
withdebugQuery=on
and see what happens.These are the results (included the score through
fl=*,score
) sorted byscore desc
:And this is the explanation:
The scoring factors here are:
In your example the
fieldNorm
makes the difference. You have one document with lowertermFreq
(1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)
UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the
fieldNorm
makes the difference. Add the attributeomitNorms=true
to yourtext_ws
field in theschema.xml
and reindex. The same query will give you the following result:As you can see now the
termFreq
wins and thefieldNorm
is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation withdebugQuery=on
:you could use Lucenes SweetSpotSimilarity to define the plateau of lengths that should all have a norm of 1.0. this could help you with your situation as long as you are searching for stuff like names etc. lengthNorm doesn't do any good.