I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
Same result, just a little cleaner, is to use the
LuceneDictionary
in thelucene-suggest
package. It takes care of a field that does not contain any terms by returning anBytesRefIterator.EMPTY
. That will save you a NPE :)You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a
HashSet
with them.The alternative would be to use terms() and pick only terms for the field you're interested in:
This is not the optimal solution, you're reading and then discarding all other fields. There's a class
Fields
in Lucene 4, that returns terms(field) only for a single field.If you are using the Lucene 4.0 api, you need to get the fields out of the index reader. The Fields then offers the way to get the terms for each field in the index. Here is an example of how to do that:
Eventually, for the new version of Lucene you can get the string from the BytesRef calling:
instead of
If you want to get the document frequency, you can do :
The answers using
TermsEnum
andterms.next()
have a subtle off by one bug. This is because theTermsEnum
already points to the first term, sowhile(terms.next())
will cause the first term to be skipped.Instead use a for loop:
To modify the code from the accepted answer: