How to avoid duplicate document indexing in Lucene

2019-07-23 04:58发布

I am creating a Lucene Index for values got from database. I have set Index OpenMode as OpenMode.CREATE_OR_APPEND.

Index creation step is part of a Spring Batch Job.

My understanding is that when I run job for the first time, indexing might take a while but when I rerun the job again for same unchanged source data, it should be fast because document is already there so UPDATE OR INSERT has not be performed.

But for my case, subsequent indexing attempts for same unchanged source data gets slower and slower.

Answer to this question says that it will be handled automatically based on a term.

I am not sure as how to I define the term in my case to handle this?

Below is my sample code,

        public Integer createIndex(IndexWriter writer, String str, LuceneIndexerInputVO luceneInputVO) throws Exception {
            Integer count = 0;
            Document d = null;
            txtFieldType.setTokenized(false);
            strFieldType.setTokenized(false);

            List<IndexVO> indexVO = null;

            indexVO = jdbcTemplate.
                    query(Constants.SELECT_FROM_TABLE1, 
                            new Object[] {luceneInputVO.getId1(), luceneInputVO.getId2(), str}, 
                            new IndexRowMapper());

            while (!indexVO.isEmpty()) {
                d = new Document();
                d.add(getStringField(Constants.ID, String.valueOf(luceneInputVO.getId())));
                .....
                ....
                writer.addDocument(d);
                indexVO.remove(indexVO.get(count));
                count++;
            }
            return count;
        }

What should I change in above code to not perform indexing when there is no change in source data?

I am a beginner to Lucene and not sure as how to define that Term which would decide about duplicity.

I don't want indices to be recreated and I wish new Document to be skipped ( don't do anything ) if exactly same Document already exists in Index.

EDIT - I asked a long question but after reading SO for few Lucene related questions, I realize that I am simply asking for incremental indexing approach while focusing on duplicate avoidance provided a document represents a row of a RDBMS table having a primary key. If DB row is changed, update document otherwise not and add docs for new rows.

Question 1,Question 2

1条回答
手持菜刀,她持情操
2楼-- · 2019-07-23 05:55

I have verified that in Lucene 6.0.0 , IndexWriter.updateDocument(Term term,Document doc); adds a new Document if document doesn't already exist and updates existing Document if found as per term.

For my requirement, I defined a key field which is basically a concatenation of all other value fields for Document. This way key identifies content wise duplicates i.e. for two documents having same key means that documents are content wise duplicates.

I construct term to be passed to IndexWriter.updateDocument(Term term,Document doc); on this key value and just calling IndexWriter.updateDocument(Term term,Document doc); instead of IndexWriter.addDocument(Document doc) solves issue.

查看更多
登录 后发表回答