I am trying to find the most frequent words in the text field of an indexed document using Solr 4.10. I created a PDF document from a text file with some text and posted it to Solr using post.jar and when queried based on its id it gives me pdf contents which are shown below and all meta-data of the document.
<arr name="text">
<str>sample1</str>
<str/>
<str>application/pdf</str>
<str>
sample1 sample1.txt cook cook1 book1 book1 book2 nook1 nook1 nook2 nook2 two three four Page 1
</str>
</arr>
In summary I want to detect that we have cook, cook1 with count 1 each and book1,book2,nook1, nook2 with count 2 each.
I used TermVectorComponent configuration from TermVectorComponent and my schema.xml has the text field:
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
and solrconfig.xml has
<searchComponent name="tvComponent" class="solr.TermVectorComponent"/>
<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
The field type 'text_general' is defined as:
<fieldType class="solr.TextField" name="text_general" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Finally when I query it from browser using following query which I think is requesting the word count in the 'text' field of the document with id provided i.e.
http://localhost:8983/solr/select/?q=id:7e75017b-066d-4257-af10-b770726c7cf4&start=0&rows=100&indent=on&qt=tvrh&tv=true&tv.fl=text&f.text.tv.tf=true&tv.fl=text
it returns me all information of the document response except the word count. I only want to see the word count in the 'text' field just like the response we obtain when we use rows=0 for faceting i.e. an string array of word vs count.
Any help will be greatly appreciated.
NOTE: I am trying to get word frequency of 'text' field of one document not of 'text' field of all indexed documents. In other words, how to ask Solr to avoid throwing away duplicate tokens or duplicate stemmed tokens so we can search for most frequent words in a field.