termfreq for a phrase

2019-05-21 22:00发布

问题:

I'm using SOLR 4.x termfreq feature in the following example to find "autozero amplifiers" in a field CONTENTS.

http://localhost:8080/solr/select/?fl=contents,documentPageId,termfreq%28contents,%27autozero%20amplifiers%27%29&defType=func&q=termfreq%28contents,%27autozero%20amplifiers%27%29&fq=documentId%3A49667

I am getting zero frequency for the following paragraph which contains the phrase "autozero amplifiers".

What do I have to do either to solrconfig.xml or schema.xml in order to use termfreq on a phrase not just one word "amplifier"?

回答1:

Unless you let Lucene consider "autozero amplifiers" as one term, you can't use term vectors to get what you are looking for. You could use KeywordTokenizerFactory for indexing, which doesn't actually tokenize the words, it preserves the entire stream of text as one token. But if, for instance, the field you are interested in is containing following text,

 "The quick brown fox jumps over the lazy dog"

how do you define your term boundaries ?

 The quick
 The quick brown
 quick brown
 quick brown fox jumps
 over the lazy dog
 .....

the combination grows exponentially for a singe field of value. Since I have been answering some of your questions related to term vectors leading up to this one, my guess is that you are trying to bend Solr/Lucene to count word/set of words in a large document. You could consider integrating Solr with Hadoop, let Hadoop do all the counting for you. Heck! every Hadoop example talks about word count & line count.. Solr + Hadoop = Big Data Love or perhaps you could do it in your own app layer.

I don't have much info on your application data volume, requirement goals etc.. so this is a suggestion at best.



回答2:

You may try the following trick

  1. termfreq() on both the words individually and do the sum() to get the count of it.

  2. Further, you may use if() to check your values.

Hope, this sounds good for your requirement.



标签: solr