I am using DataImportHandler for indexing data in SOLR. I used full-import to index all the data in the my database which is around 10000 products.Now I am confused with the delta-import usage? Does it index the new data added into the database on interval basis i mean it is going to index the new data added to my table around 10 rows or it just updates the changes in the already indexed data.
Can anyone please explain it to me with simple example as soon as you can.
The DataImportHandler can be a little daunting. Your initial query has loaded 10.000 unique products. This is loaded if you specify /dataimport?command=full-import.
When this import is done, the DIH stores a variable ({dataimporter.last_index_time}) which is the last date/time you did this import.
In order to do an update, you specify a deltaQuery. The deltaQuery is meant to identify the records that have changed in your database since the last update. So, you specify a query like this: SELECT product_id
FROM sometable
WHERE [date_update] >= '${dataimporter.last_index_time}'
This will retrieve all the product_ids from your database that are updated since you last full update. The next query (deltaImportQuery) you need to specify is the query that will retrieve the full record for each product_id that you have from the previous step.
Assuming product_id is you unique key, solr will figure out that it needs to update an existing record, or add one if the product_id doens't work.
In order to execute the deltaQuery and the deltaImportQuery you use /dataimport?command=delta-import
This is a great simplification of all the possibilities, check the Solr wiki on DataImportHandler, it is a VERY powerful tool!
On another note:
When you use a delta import within a small time window (like a couple of times in a few seconds) and the database server is on an other machine than the solr index service, make sure that the systemtime
of both machines matches, since the timestamp of [date_update]
is generated on the database server and dataimporter.last_index_time
is generated on the other.
Otherwise you won't be updating the index (or too much) depending on the time differences.
I agree that the Data Import Handler can handle this situation. One important limitation to the DIH is that it does not queue requests. The result of this is that if the DIH is "busy" indexing it will ignore all future DIH requests until it is "idle" again. The skipped DIH requests are lost and not executed.