Approach to Incrementally Index Database Data from

2019-09-16 03:45发布

问题:

I have a particular SQL join such that:

select DISTINCT ... 100 columns
from ... 10 tabes, some left joins

Currently I export the result of this query to XML using Toad (I'll query it straight from Java later). I use Java to parse the XML file, and I use Lucene (Java) to index it and to search the Lucene index. This works great: I get results 6-10 times faster than querying it from the database.

I need to think of a way to incrementally update this index when the data in the database changes.

Because I am joining tables (especially left joins) I'm not sure I can get a unique business key combination to do an incremental update. On the other hand, because I am using DISTINCT, I know that every single field is a unique combination. Given this information, I thought I could put the hashCode of a document as a field of the document, and call updateDocument on the IndexWriter like this:

public static void addDoc(IndexWriter w, Row row) throws IOException {
    //Row is simply a java representation of a single row from the above query
    Document document = new Document();
    document.add(new StringField("fieldA", row.fieldA, Field.Store.YES));
    ...
    String hashCode = String.valueOf(document.hashCode());
    document.add(new StringField("HASH", hashCode, Field.Store.YES));
    w.updateDocument(new Term("HASH", hashCode), document);
}

Then I realized that updateDocument was actually deleting the document with the matching hash code and adding the identical document again, so this wasn't of any use.

What is the way to approach this?

回答1:

Lucene has no concept of "updating" a document. So an update or an add is essentially a delete + add.

YOu can track the progress here - https://issues.apache.org/jira/browse/LUCENE-4258

So you will need to keep the logic of doc.hashCode() in your app i.e Do not ask lucene to index a document if you know that no values have changed ( You can have a set of hashCode values and if it matches it then that document has not changed ) . You might want to have a logic for tracking deletes also



回答2:

If you increment an id on each relevant update of your source DB tables and if you log these ids on record deletion,

you should then be able to list deleted, updated and new records of your data being indexed.

This step might be performed within a transitory table, itself extracted into the xml file used as input to lucene.



标签: sql lucene