Index a MySQL database with Apache Lucene, and kee

2020-05-20 02:52发布

问题:

  1. When a new item is added in MySQL, it must be also indexed by Lucene.
  2. When an existing item is removed from MySQL, it must be also removed from Lucene's index.

The idea is to write a script that will be called every x minutes via a scheduler (e.g. a CRON task). This is a way to keep MySQL and Lucene synchronized. What I managed until yet:

  1. For each new added item in MySQL, Lucene indexes it too.
  2. For each already added item in MySQL, Lucene does not reindex it (no duplicated items).

This is the point I'm asking you some help to manage:

  1. For each previously added item that has been then removed from MySQL, Lucene should also unindex it.

Here is the code I used, which tries to index a MySQL table tag (id [PK] | name):

public static void main(String[] args) throws Exception {

    Class.forName("com.mysql.jdbc.Driver").newInstance();
    Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/mydb", "root", "");
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
    IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), config);

    String query = "SELECT id, name FROM tag";
    Statement statement = connection.createStatement();
    ResultSet result = statement.executeQuery(query);

    while (result.next()) {
        Document document = new Document();
        document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
        writer.updateDocument(new Term("id", result.getString("id")), document);
    }

    writer.close();

}

PS: this code is for tests purpose only, no need to tell me how awful it is :)

EDIT:

One solution could be to delete any previsouly added document, and reindex all the database:

writer.deleteAll();
while (result.next()) {
    Document document = new Document();
    document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
    document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
    writer.addDocument(document);
}

I'm not sure it's the most optimized solution, is it?

回答1:

As long as you let the indexing/reindexing run separately from your application, you will have synchronization problems. Depending on your field of work, this might not be a problem, but for many concurrent-user-applications it is.

We had the same problems when we had a job system running asynchronous indexing every few minutes. Users would find a product using the search engine, then even when an administrative person removed the product from the valid product stack, still found it in the frontend, until the next reindexing job ran. This leads to very confusing and seldomly reproducable errors reported to first level support.

We saw two possibilities: Either connect the business logic tightly to updates of the search index, or implement a tighter asynchronous update task. We did the latter.

In the background, there's a class running in a dedicated thread inside the tomcat application that takes updates and runs them in parallel. The waiting times for backoffice updates to frontend are down to 0.5-2 seconds, which greatly reduces the problems for first level support. And, it is as loosely coupled as can be, we could even implement a different indexing engine.



回答2:

Take a look at Solr DataImportScheduler approach.
It basically, when a web application starts, spawns a separate Timer thread, which periodically fires HTTP Post against Solr, which then uses DataImportHandler you set up to pull data from a RDB (and other data sources).

So, since you're not using Solr, only Lucene, you should take a look at DataImportHandler source for ideas.