I'm using SOLR 4.x termfreq feature in the following example to find "autozero amplifiers" in a field CONTENTS.
http://localhost:8080/solr/select/?fl=contents,documentPageId,termfreq%28contents,%27autozero%20amplifiers%27%29&defType=func&q=termfreq%28contents,%27autozero%20amplifiers%27%29&fq=documentId%3A49667
I am getting zero frequency for the following paragraph which contains the phrase "autozero amplifiers".
What do I have to do either to solrconfig.xml or schema.xml in order to use termfreq on a phrase not just one word "amplifier"?
Unless you let Lucene consider "autozero amplifiers" as one term, you can't use term vectors to get what you are looking for. You could use KeywordTokenizerFactory
for indexing, which doesn't actually tokenize the words, it preserves the entire stream of text as one token. But if, for instance, the field you are interested in is containing following text,
"The quick brown fox jumps over the lazy dog"
how do you define your term boundaries ?
The quick
The quick brown
quick brown
quick brown fox jumps
over the lazy dog
.....
the combination grows exponentially for a singe field of value. Since I have been answering some of your questions related to term vectors
leading up to this one, my guess is that you are trying to bend Solr/Lucene
to count word/set of words in a large document. You could consider integrating Solr with Hadoop, let Hadoop do all the counting for you. Heck! every Hadoop example talks about word count & line count.. Solr + Hadoop = Big Data Love or perhaps you could do it in your own app layer.
I don't have much info on your application data volume, requirement goals etc.. so this is a suggestion at best.
You may try the following trick
termfreq() on both the words individually and do the sum() to get the count of it.
Further, you may use if() to check your values.
Hope, this sounds good for your requirement.