Word frequency in Solr

2020-03-04 07:57发布

I am trying to get frequency of words using solr. When I give this query :

localSolr/solr/select?q=someQuery&rows=0&facet=true&facet.field=content&wt=xml

solr gives me the frequencies like;

<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="content">
<int name="word1">24</int>
<int name="word2">12</int>
<int name="word3">8</int>

But when I count the words; I find that word2's actual count number is 13. Solr counts same words in the field as one.

For example;

field text consists; word2 word5 word7 word9 word2. Solr doesn't return word2's count number 2 instead it returns 1. It returns 1 for the count of word2 for the both sentences below;

word2 word10 word11 word12
word2 word9 word7 word2 word23

So frequencies return wrongly. I have checked facet fields but didn't find the proper parameter for that. How can I fix it so that it counts same words in sentence?

edit : relevant part of schema.xml :

<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
    <field name="content" type="text_tr" stored="true" indexed="true" multiValued="true"/>
    <copyField source="content" dest="text"/>
    <field name="text" type="text_tr" stored="false" indexed="true" multiValued="true"/>

2条回答
孤傲高冷的网名
2楼-- · 2020-03-04 08:17

Use the luke request handler

http://localhost:8983/solr/admin/luke?fl=YOUR_TEXT_FIELD&numTerms=500

more info: http://wiki.apache.org/solr/LukeRequestHandler

查看更多
我命由我不由天
3楼-- · 2020-03-04 08:18

if the field you're faceting on is multivalued, then each word in the facet gets the proper count

i forgot to mention one thing: Term Vector Component will get you where you need

in the query, tv.tf will give you the term frequency for each term, while tv.fl tells solr on which fields the frequency should be calculated

NB this makes your indexing time slower than now (aka: you have to try it)

查看更多
登录 后发表回答