How to extract Document Term Vector in Lucene 3.5.

I am using Lucene 3.5.0 and I want to output term vectors of each document. For example I want to know the frequency of a term in all documents and in each specific document. My indexing code is:

import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {
public static void main(String[] args) throws Exception {
        if (args.length != 2) {
        throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() + " <index dir> <data dir>");
    }

    String indexDir = args[0];
    String dataDir = args[1];
    long start = System.currentTimeMillis();
    Indexer indexer = new Indexer(indexDir);
    int numIndexed;
    try {
        numIndexed = indexer.index(dataDir, new TextFilesFilter());
    } finally {
        indexer.close();
    }
    long end = System.currentTimeMillis();
    System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
}

private IndexWriter writer;

public Indexer(String indexDir) throws IOException {
    Directory dir = FSDirectory.open(new File(indexDir));
    writer = new IndexWriter(dir,
        new StandardAnalyzer(Version.LUCENE_35),
        true,
        IndexWriter.MaxFieldLength.UNLIMITED);
}

public void close() throws IOException {
    writer.close();
}

public int index(String dataDir, FileFilter filter) throws Exception {
    File[] files = new File(dataDir).listFiles();
    for (File f: files) {
        if (!f.isDirectory() &&
        !f.isHidden() &&
        f.exists() &&
        f.canRead() &&
        (filter == null || filter.accept(f))) {
            BufferedReader inputStream = new BufferedReader(new FileReader(f.getName()));
            String url = inputStream.readLine();
            inputStream.close();
            indexFile(f, url);
        }
    }
    return writer.numDocs();
}

private static class TextFilesFilter implements FileFilter {
    public boolean accept(File path) {
        return path.getName().toLowerCase().endsWith(".txt");
    }
}

protected Document getDocument(File f, String url) throws Exception {
    Document doc = new Document();
    doc.add(new Field("contents", new FileReader(f)));
    doc.add(new Field("urls", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    return doc;
}

private void indexFile(File f, String url) throws Exception {
    System.out.println("Indexing " + f.getCanonicalPath());
    Document doc = getDocument(f, url);
    writer.addDocument(doc);
}
}

can anybody help me in writing a program to do that? thanks.

标签： java indexing lucene

2条回答

等我变得足够好

2楼-- · 2020-02-26 13:46

I am on Lucene core 3.0.3, but I expect the API will be very similar. This method will total up a term frequency map for a given set of Document numbers and a list of fields of interest, ignoring stop words.

    /**
 * Sums the term frequency vector of each document into a single term frequency map
 * @param indexReader the index reader, the document numbers are specific to this reader
 * @param docNumbers document numbers to retrieve frequency vectors from
 * @param fieldNames field names to retrieve frequency vectors from
 * @param stopWords terms to ignore
 * @return a map of each term to its frequency
 * @throws IOException
 */
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException {
    Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);

    for (Integer docNum : docNumbers) {
        for (String fieldName : fieldNames) {
            TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
            if (tfv == null) {
                // ignore empty fields
                continue;
            }

            String terms[] = tfv.getTerms();
            int termCount = terms.length;
            int freqs[] = tfv.getTermFrequencies();

            for (int t=0; t < termCount; t++) {
                String term = terms[t];
                int freq = freqs[t];

                // filter out single-letter words and stop words
                if (StringUtils.length(term) < 2 ||
                    stopWords.contains(term)) {
                    continue; // stop
                }

                Integer totalFreq = totalTfv.get(term);
                totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
                totalTfv.put(term, totalFreq);
            }
        }
    }

    return totalTfv;
}

0人赞添加讨论(0) 举报

戒情不戒烟

3楼-- · 2020-02-26 13:54

First of all, you don't need to store term vectors in order to know solely the frequency of term in documents. Lucene stores these numbers nevertheless to use in TF-IDF calculation. You can access this information by calling IndexReader.termDocs(term) and iterating over the result.

If you have some other purpose in mind and you actually need to access the term vectors, then you need to tell Lucene to store them, by passing Field.TermVector.YES as the last argument of Field constructor. Then, you can retrieve the vectors e.g. with IndexReader.getTermFreqVector().

0人赞添加讨论(0) 举报

How to extract Document Term Vector in Lucene 3.5.

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间