How can I get the list of unique terms from a spec

2019-02-05 02:38发布

I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?

标签: java lucene
4条回答
成全新的幸福
2楼-- · 2019-02-05 02:46

Same result, just a little cleaner, is to use the LuceneDictionary in the lucene-suggest package. It takes care of a field that does not contain any terms by returning an BytesRefIterator.EMPTY. That will save you a NPE :)

    LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
    BytesRefIterator iterator = ld.getWordsIterator();
    BytesRef byteRef = null;
    while ( ( byteRef = iterator.next() ) != null )
    {
        String term = byteRef.utf8ToString();
    }
查看更多
地球回转人心会变
3楼-- · 2019-02-05 02:54

You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a HashSet with them.

The alternative would be to use terms() and pick only terms for the field you're interested in:

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
        final Term term = terms.term();
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}

This is not the optimal solution, you're reading and then discarding all other fields. There's a class Fields in Lucene 4, that returns terms(field) only for a single field.

查看更多
疯言疯语
4楼-- · 2019-02-05 02:54

If you are using the Lucene 4.0 api, you need to get the fields out of the index reader. The Fields then offers the way to get the terms for each field in the index. Here is an example of how to do that:

        Fields fields = MultiFields.getFields(indexReader);
        Terms terms = fields.terms("field");
        TermsEnum iterator = terms.iterator(null);
        BytesRef byteRef = null;
        while((byteRef = iterator.next()) != null) {
            String term = new String(byteRef.bytes, byteRef.offset, byteRef.length);

        }

Eventually, for the new version of Lucene you can get the string from the BytesRef calling:

       byteRef.utf8ToString();

instead of

       new String(byteRef.bytes, byteRef.offset, byteRef.length);

If you want to get the document frequency, you can do :

       int docFreq = iterator.docFreq();
查看更多
在下西门庆
5楼-- · 2019-02-05 03:00

The answers using TermsEnum and terms.next() have a subtle off by one bug. This is because the TermsEnum already points to the first term, so while(terms.next()) will cause the first term to be skipped.

Instead use a for loop:

TermEnum terms = reader.terms();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    // do something with the term
}

To modify the code from the accepted answer:

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}
查看更多
登录 后发表回答