I am creating a Lucene Index for values got from database. I have set Index OpenMode
as OpenMode.CREATE_OR_APPEND
.
Index creation step is part of a Spring Batch Job.
My understanding is that when I run job for the first time, indexing might take a while but when I rerun the job again for same unchanged source data, it should be fast because document is already there so UPDATE OR INSERT has not be performed.
But for my case, subsequent indexing attempts for same unchanged source data gets slower and slower.
Answer to this question says that it will be handled automatically based on a term.
I am not sure as how to I define the term in my case to handle this?
Below is my sample code,
public Integer createIndex(IndexWriter writer, String str, LuceneIndexerInputVO luceneInputVO) throws Exception {
Integer count = 0;
Document d = null;
txtFieldType.setTokenized(false);
strFieldType.setTokenized(false);
List<IndexVO> indexVO = null;
indexVO = jdbcTemplate.
query(Constants.SELECT_FROM_TABLE1,
new Object[] {luceneInputVO.getId1(), luceneInputVO.getId2(), str},
new IndexRowMapper());
while (!indexVO.isEmpty()) {
d = new Document();
d.add(getStringField(Constants.ID, String.valueOf(luceneInputVO.getId())));
.....
....
writer.addDocument(d);
indexVO.remove(indexVO.get(count));
count++;
}
return count;
}
What should I change in above code to not perform indexing when there is no change in source data?
I am a beginner to Lucene and not sure as how to define that Term
which would decide about duplicity.
I don't want indices to be recreated and I wish new Document
to be skipped ( don't do anything ) if exactly same Document
already exists in Index.
EDIT - I asked a long question but after reading SO for few Lucene related questions, I realize that I am simply asking for incremental indexing approach while focusing on duplicate avoidance provided a document represents a row of a RDBMS table having a primary key. If DB row is changed, update document otherwise not and add docs for new rows.
I have verified that in Lucene 6.0.0 ,
IndexWriter.updateDocument(Term term,Document doc);
adds a new Document if document doesn't already exist and updates existing Document if found as perterm
.For my requirement, I defined a
key
field which is basically a concatenation of all other value fields forDocument
. This waykey
identifies content wise duplicates i.e. for two documents having samekey
means that documents are content wise duplicates.I construct
term
to be passed toIndexWriter.updateDocument(Term term,Document doc);
on thiskey
value and just callingIndexWriter.updateDocument(Term term,Document doc);
instead ofIndexWriter.addDocument(Document doc)
solves issue.