I am trying to get frequency of words using solr. When I give this query :
localSolr/solr/select?q=someQuery&rows=0&facet=true&facet.field=content&wt=xml
solr gives me the frequencies like;
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="content">
<int name="word1">24</int>
<int name="word2">12</int>
<int name="word3">8</int>
But when I count the words; I find that word2's actual count number is 13. Solr counts same words in the field as one.
For example;
field text consists; word2 word5 word7 word9 word2
. Solr doesn't return word2's count number 2 instead it returns 1. It returns 1 for the count of word2 for the both sentences below;
word2 word10 word11 word12
word2 word9 word7 word2 word23
So frequencies return wrongly. I have checked facet fields but didn't find the proper parameter for that. How can I fix it so that it counts same words in sentence?
edit : relevant part of schema.xml :
<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
<field name="content" type="text_tr" stored="true" indexed="true" multiValued="true"/>
<copyField source="content" dest="text"/>
<field name="text" type="text_tr" stored="false" indexed="true" multiValued="true"/>
Use the luke request handler
http://localhost:8983/solr/admin/luke?fl=YOUR_TEXT_FIELD&numTerms=500
more info: http://wiki.apache.org/solr/LukeRequestHandler
if the field you're faceting on is multivalued, then each word in the facet gets the proper count
i forgot to mention one thing: Term Vector Component will get you where you need
in the query, tv.tf will give you the term frequency for each term, while tv.fl tells solr on which fields the frequency should be calculated
NB this makes your indexing time slower than now (aka: you have to try it)