I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information for phrases e.g. "information retrieval" is indexed as two separate words.
What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!
EDIT: I store my documents just by title and content:
Document doc = new Document();
doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));
because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (e.g., I have many PDF academic papers whose titles are codes or numbers).
I desperately need to index top occurring phrases from text contents, just now I see how much this simple "bag of words" approach is not efficient.
Well the problem of losing the context for phrases can be solved by using PhraseQuery.
An index by default contains positional information of terms, as long as you did not create pure Boolean fields by indexing with the omitTermFreqAndPositions option. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.
For example, suppose a field contained the phrase “the quick brown fox jumped over the lazy dog”. Without knowing the exact phrase, you can still find this document by searching for documents with fields having quick and fox near each other. Sure, a plain TermQuery would do the trick to locate this document knowing either of those words, but in this case we only want documents that have phrases where the words are either exactly side by side (quick fox) or have one word in between (quick [irrelevant] fox). The maximum allowable positional distance between terms to be considered a match is called slop. Distance is the number of positional moves of terms to reconstruct the phrase in order.
Check out Lucene's JavaDoc for PhraseQuery
See this example code which demonstrates how to work with various Query Objects:
You can also try to combine various query types with the help of the BooleanQuery class.
And regarding the frequency of phrases, I suppose Lucene's scoring considers the frequency of the terms occurring in the documents.
Is it possible for you to post any code that you have written?
Basically a lot depends on the way you create your fields and store documents in lucene.
Lets consider a case where I have got two fields: ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.
Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.
So what I will do is, I will create a document (
org.apache.lucene.document.Document
) object to take care of this... Something like thisSo, essentially I have created two fields:
Field.Index.ANALYZED
Field.Index.NOT_ANALYZED
This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.
Link(s) http://darksleep.com/lucene/
Hope this will help you... :)
Julia, It seems what you are looking for is n-grams, specifically Bigrams (also called collocations).
Here's a chapter about finding collocations (PDF) from Manning and Schutze's Foundations of Statistical Natural Language Processing.
In order to do this with Lucene, I suggest using Solr with ShingleFilterFactory. Please see this discussion for details.