I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version.
What I have:
- Lucene Index + IndexReader + IndexSearcher
- a bunch of documents (and their IDs)
What I want: For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document. Or to say it differently: I want to get for any term that occurs in any of the selected documents its TF-IDF value, e.g., as an array (i.e., one TF-IDF value for each of the selected documents).
Any help is highly appreciated! :-)
Here's what I've come up with so far, but there are 2 problems:
- It is using a temporarily created RAMDirectory which contains only the selected Documents. Is there any way to work on the original Index or does that not make sense?
- It does not get document based TF IDF but somehow only index based, ie., all documents. Which means for each term I only get one TF-IDF value but not one for each document and term.
public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {
Bits liveDocs = MultiFields.getLiveDocs(reader);
TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
BytesRef term = null;
TFIDFSimilarity tfidfSim = new DefaultSimilarity();
int docCount = reader.numDocs();
while ((term = termEnum.next()) != null) {
String termText = term.utf8ToString();
Term termInstance = new Term(field, term);
// term and doc frequency in all documents
long indexTf = reader.totalTermFreq(termInstance);
long indexDf = reader.docFreq(termInstance);
double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
// store it, but that's not the problem