Aim
To synchronize my elasticsearch server with new and expired data in my SQL database
Issue
There are two very different ways I can achieve this and I don't know which is better. I can either pull information to elasticsearch with a direct connection to the SQL database using the JDBC river plugin. Alternatively I can push data to elasticsearch using the PHP client using the code shown below as an example:
// The Id of the document
$id = 1;
// Create a document
$tweet = array(
'id' => $id,
'user' => array(
'name' => 'mewantcookie',
'fullName' => 'Cookie Monster'
),
'msg' => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
'tstamp' => '1238081389',
'location'=> '41.12,-71.34',
'_boost' => 1.0
);
// First parameter is the id of document.
$tweetDocument = new \Elastica\Document($id, $tweet);
// Add tweet to type
$elasticaType->addDocument($tweetDocument);
// Refresh Index
$elasticaType->getIndex()->refresh();
I was going to have a cron run every thirty minuets to check for items in my database that not only have an "active" flag but that also do not have an "indexed" flag, that means I need to add them to the index.
QUESTION
Seeing as I have two methods to synchronize data between elasticsearch and mysql in two different ways, what is the advantage and disadvantage of each option. Is there a specific usecase which defines using one over the other?
I would use the river method.
Advantages of the river:
Advantages of your solution:
Disadvantages of your solution:
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
If you forget for a moment that you need to import initial data into Elasticsearch, I would use an event system to push data to Elasticsearch. This is more efficient in the long run.
Your application knows exactly when something needs to be indexed by Elasticsearch. To take your tweet example, at some point a new tweet will enter your application (a user writes one for example). This would trigger a
newTweet
event. You have a listener in place that will listen to that event, and store the tweet in Elasticsearch whenever such an event is dispatched.If you don't want to use resources/time in the web request to do this (and you definitely don't want to do this), the listener could add a job to a queue (Gearman or Beanstalkd for example). You would then need a worker that will pick that job up and store the tweet in Elasticsearch.
The main advantage is that Elasticsearch is kept up-to-date more real-time. You won't need a cronjob that would introduce a delay. You'll (mostly) handle a single document at a time. You won't need to bother the SQL database to find out what needs to be (re)indexed.
Another advantage is that you can easily scale when the amount of events/data gets out of hand. When Elasticsearch itself needs more power, add servers to the cluster. When the worker can't handle the load, simply add more of them (and place them on dedicated machines). Plus your webserver(s) and SQL database(s) won't feel a thing.
I would use the river method even thought a in house build solution might be more customizable.
On one side, the jdbc-river plugin is a plugin that is already built and it has around 20 contributors so far. So you kind have a extra team working to improve that tool along the way elasticsearch itself is improving.
All you'll have is to install it and you don't even need a complexed configuration to set a river between your cluster and your relational database.
Another advantage with the jdbc-river solution is that you don't need to deal with memory management. The plugin can operate as a river in "pull mode" or as a feeder in "push mode". In feeder mode, the plugin runs in a separate JVM and can connect to a remote Elasticsearch cluster. I personally prefere the river mode because in this case Elasticsearch would deal with the indexing and memory management issues.
The relational data is internally transformed into structured JSON objects for the schema-less indexing model of Elasticsearch documents.
Both ends are scalable. The plugin can fetch data from different RDBMS source in parallel, and multithreaded bulk mode ensures high throughput when indexing to Elasticsearch.
One of the drawbacks of this solution is that it doesn't notify when it's done indexing. As a solution for that I suggest that you use the Count API to compare results.
Another drawback of the river is that it doesn't pull on update, it just does on insert or delete. I'm referring of course the sql actions UPDATE, INSERT and DELETE.
On second hand, your solution might bring some advantages and drawbacks you might want to consider.
You solution is highly customizable, so you can manage your scripts however you want. But considering the current state of any PHP Elasticsearch client available (Official Elasticseach-php Client , Elastica or FOSElasticaBundle), and even thought the guys are doing a great job on them, it's still considered as a not very mature APIs to work with on that level comparing to the official Elasticsearch JAVA API used for the river.
You should also consider dealing with all the errors that can throw your cluster at you from memory loss, management, performance, etc.
Ex: I tried to build a Proof of Concept using the Elastica API pushing my data from my database to my cluster, with a configuration of 32g RAM, 8 cores running @2.05GHz each, in a test environment, without getting into much details. It took me 5 hours to push 10M records from the database to the cluster. Where as with the river, it takes 20 minutes for the same records. Of course there might be optimizations that can be done around my code but I've consider it more time-consuming that it can bring me.
So, as long as you can customize the river according to your needs, use it. If the river doesn't support something you want to do, then you can stick to your own solution.
NB: Of course there might be other point you might want to consider but this subject is quite long to discuss over here. So I chose some point, I found essential that you should be aware of.