Building a tag cloud with solr

2019-03-16 16:17发布

问题:

Dear stackoverflow community :

Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of what the text is about in a graphical way.

The text is actually a set of 100 or so comments PER each ITEM(a picture) there are about 120 items, and I also want to keep the cloud updated - by keeping the comments indexed, and using the cloud generation code to run each time a new web request turns up.

I settled on using Solr to index the text, and now wondering how to get the TOP 50 words, out of Solr TermsVectorComponant. Here is an example of the results returned by the terms vector componant, after you turn on term frequency by saying tv.tf="true" :

  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>    
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="earbud"><tf>3</tf></lst>
      <lst name="headphon"><tf>10</tf></lst>
      <lst name="usb"><tf>11</tf></lst>
    </lst>
  </lst>

  <lst name="doc-9">
    <str name="uniqueKey">3007WFP</str>
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="usb"><tf>4</tf></lst>
    </lst>
  </lst>

As you can see I have 2 problems :

  1. I get all the terms within the document, for that field, not just top 100
  2. And They are not sorted by frequency, so I have to get terms and sort it in-memory to do what im trying.

Is there a better way? (or) Can I tell solr termvector component to somehow sort it and pick up only 100 for me? (or) Is there some other framework which I can use? I need to keep new comments indexed as they come, so the tag cloud is always uptodate - As to the cloud generator it takes a dictionary of weighted words, and makes it into a nice image.

This answer does not help.

EDIT - trying out jpountz & paige cook's answer

Here is a result which I got for this query :

    select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true
&facet.field=Post_Content&facet.minCount=1&facet.limit=50

<int name="also">1</int>
<int name="ani">1</int>
<int name="anoth">1</int>
<int name="atleast">1</int>
<int name="base">1</int>
<int name="bcd">1</int>
<int name="becaus">1</int>
<int name="better">1</int>
<int name="bigger">1</int>
<int name="bio">1</int>
<int name="boot">1</int>
<int name="bootabl">1</int>
<int name="bootload">1</int>
<int name="bootscreen">1</int>

I got 50 such elements, @jpountz thanks for helping limit the results, BUT why does ALL FIFTY of the individual <int> elements hold the value 1? My thoughts are : The number 1 represents the count of the documents matching my query (which can only be one since I queried by Id:Guid) and they do not represent the frequency of the words in Post_Content

To prove this I removed the Id:GUID from query and result was:

<int name="content">33</int>
<int name="can">17</int>
<int name="on">16</int>
<int name="so">16</int>
<int name="some">16</int>
<int name="all">15</int>
<int name="i">15</int>
<int name="do">14</int>
<int name="have">14</int>
<int name="my">14</int>

My problem is how to get the term frequency in the document, and not the document frequency of many terms. For example I know for a fact that bootable was a word I used 6 times in Post_content, So i want sorted Pairs like (6,"bootable"), (5, "disc") for a set of documents.

回答1:

Here is an article that describes setting up a Tag Cloud - Creating a Tag Cloud with Solr and PHP. While the PHP portion may not be applicable to you, the actual generation of the tag cloud I believe is...

This article describes a method of creating a text field with a whitespace tokenizer to return individual words and then performing a facet search against this field. I know that you can set facet limits, so in your case you can only get the top 100 results.



回答2:

If a Lucene document is a comment, you could use faceting to do so. For example, the following request http://solr:port/solr/select?q={!lucene}uniqueKey:(MA147LL/A OR 3007WFP)&facet=true&facet.field=includes&facet.minCount=1&facet.limit=50 would help you build a tag cloud for comments MA147LL/A and 3007WFP.

However, this approach would :

  • make Solr instantiate an UnInvertedField instance for the includes field, which required memory,
  • count the number of documents which match a term instead of the total number of occurrences of this term.


回答3:

I have come up with a STOPGAP solution : (Im calling a each solr document a "post" for examples sake)

There is a terms component in Solr, whose purpose seems to be to expose all the indexed terms of any given field. It is mainly used to implement features like auto-complete, and other features that operate at a term level. And it is by default sorted by frequency - the more frequently occurring terms in the field come up first.

What I have done is created a dynamic field called content_ and indexed each post-set in its own field based on category. This means that there will be hundreds of instances of the dynamic field each containing one post-set, and I can use the terms component on that field to get TOP TERMS for that post-set.

As a picture :

content_postSetOne : contains indexed version of a set of posts
content_postSetTwo : contains indexed version of another set of posts
content_postSetThree : contains indexed version of a third set of posts

This solution is sort of working for me, and you can easily create a field per Post also if needed. Im also interested in knowing the implications of using dynamic fields like this : Will this be a problem?

How this is different from the Paige and jPountz answer is :

  1. The term frequency is the count of words in "A" or "A Set of Docs" and not the count of number of docs containing the term.
  2. I can get the top occurring terms from within ONE document, and if needed also from A Set of documents.
  3. I did not use faceting because it primarily gives the frequency in terms of number of docs and not in terms of number of times the word occurred irrespective of which doc.