Word frequency in Solr

2020-03-04 07:48发布

问题:

I am trying to get frequency of words using solr. When I give this query :

localSolr/solr/select?q=someQuery&rows=0&facet=true&facet.field=content&wt=xml

solr gives me the frequencies like;

<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="content">
<int name="word1">24</int>
<int name="word2">12</int>
<int name="word3">8</int>

But when I count the words; I find that word2's actual count number is 13. Solr counts same words in the field as one.

For example;

field text consists; word2 word5 word7 word9 word2. Solr doesn't return word2's count number 2 instead it returns 1. It returns 1 for the count of word2 for the both sentences below;

word2 word10 word11 word12
word2 word9 word7 word2 word23

So frequencies return wrongly. I have checked facet fields but didn't find the proper parameter for that. How can I fix it so that it counts same words in sentence?

edit : relevant part of schema.xml :

<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
    <field name="content" type="text_tr" stored="true" indexed="true" multiValued="true"/>
    <copyField source="content" dest="text"/>
    <field name="text" type="text_tr" stored="false" indexed="true" multiValued="true"/>

回答1:

if the field you're faceting on is multivalued, then each word in the facet gets the proper count

i forgot to mention one thing: Term Vector Component will get you where you need

in the query, tv.tf will give you the term frequency for each term, while tv.fl tells solr on which fields the frequency should be calculated

NB this makes your indexing time slower than now (aka: you have to try it)



回答2:

Use the luke request handler

http://localhost:8983/solr/admin/luke?fl=YOUR_TEXT_FIELD&numTerms=500

more info: http://wiki.apache.org/solr/LukeRequestHandler