How can I get the list of unique terms from a spec

I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?

标签： java lucene

4条回答

成全新的幸福

2楼-- · 2019-02-05 02:46

Same result, just a little cleaner, is to use the LuceneDictionary in the lucene-suggest package. It takes care of a field that does not contain any terms by returning an BytesRefIterator.EMPTY. That will save you a NPE :)

    LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
    BytesRefIterator iterator = ld.getWordsIterator();
    BytesRef byteRef = null;
    while ( ( byteRef = iterator.next() ) != null )
    {
        String term = byteRef.utf8ToString();
    }

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-02-05 02:54

You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a HashSet with them.

The alternative would be to use terms() and pick only terms for the field you're interested in:

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
        final Term term = terms.term();
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}

This is not the optimal solution, you're reading and then discarding all other fields. There's a class Fields in Lucene 4, that returns terms(field) only for a single field.

0人赞添加讨论(0) 举报

疯言疯语

4楼-- · 2019-02-05 02:54

If you are using the Lucene 4.0 api, you need to get the fields out of the index reader. The Fields then offers the way to get the terms for each field in the index. Here is an example of how to do that:

        Fields fields = MultiFields.getFields(indexReader);
        Terms terms = fields.terms("field");
        TermsEnum iterator = terms.iterator(null);
        BytesRef byteRef = null;
        while((byteRef = iterator.next()) != null) {
            String term = new String(byteRef.bytes, byteRef.offset, byteRef.length);

        }

Eventually, for the new version of Lucene you can get the string from the BytesRef calling:

       byteRef.utf8ToString();

instead of

       new String(byteRef.bytes, byteRef.offset, byteRef.length);

If you want to get the document frequency, you can do :

       int docFreq = iterator.docFreq();

0人赞添加讨论(0) 举报

在下西门庆

5楼-- · 2019-02-05 03:00

The answers using TermsEnum and terms.next() have a subtle off by one bug. This is because the TermsEnum already points to the first term, so while(terms.next()) will cause the first term to be skipped.

Instead use a for loop:

TermEnum terms = reader.terms();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    // do something with the term
}

To modify the code from the accepted answer:

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}

0人赞添加讨论(0) 举报

How can I get the list of unique terms from a spec

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间