I have a folder (MY_FILES) that has around 500 files and each day a new file arrives and it's placed there. Size of each file is around 4Mb.
I've just developed a simple 'void main' to test if I can search for a specific wildcard in those files. It works just fine.
Problem is that I'm deleting the old indexed_folder and reindex again. This takes a lot of time and obviously is inefficient. What I'm looking for is an 'incremental indexing'. Meaning, if the index exists already - just add the new files to the index.
I was wondering if Lucene has some kind of mechanism to check if the 'doc' was indexed before trying to index it. Something like writer.isDocExists?
My code looks like this:
// build the writer
IndexWriter writer;
IndexWriterConfig indexWriter = new IndexWriterConfig(Version.LUCENE_36, analyzer);
writer = new IndexWriter(fsDir, indexWriter);
writer.deleteAll(); //must - otherwise it will return duplicated result
//build the docs and add to writer
File dir = new File(MY_FILES);
File[] files = dir.listFiles();
int counter = 0;
for (File file : files)
String path = file.getCanonicalPath();
FileReader reader = new FileReader(file);
Document doc = new Document();
doc.add(new Field("filename", file.getName(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("path", path, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("content", reader));
System.out.println("indexing "+file.getName()+" "+ ++counter+"/"+files.length);
First, you should use
IndexWriter.updateDocument(Term, Document)
instead ofIndexWriter.addDocument
to update documents, this will prevent your index from containing duplicated entries.To perform incremental indexing, you should add the
time stamp to the documents of your index, and only index documents that are newer.EDIT: more details on incremental indexing
Your documents should have at least two fields:
Before starting indexing, just search your index for the latest time stamp and then crawl your directory to find all files whose time stamp is newer than the newest time stamp of the index.
This way, your index will be updated every time a file changes.
If you want to check if your document is already present in the index, one method could be to generate the associated Lucene query which you will use with an
to search the Lucene index.For instance, here, you can build a query using the fields
to check whether the document is already present in the index.You will need an
besides yourIndexWriter
and follows the Lucene query syntax to generate the full text query you will provide to Lucene (e.g.).
In the code above,
contains a callback method collect which will be called with a document id if some data in the index matches the query.