Updating Elastic Search index when SQL Server tabl

2019-09-02 12:45发布

问题:

What is the best way to know when an Elastic Search index needs updating, assuming the upstream data source is transactional SQL Server tables with inserts, updates and deletes?

Examples: Tables Parent, Child, Grandchild.

Parent      |  Child                | Grandchild
ID   Name   |  ID  ParentID  Name   | ID  ChildID Amount
1    Foo    |  10   1         Bike  | 100 10      5
2    Bar    |  20   1         Car   | 200 20      2
3    Baz    |  30   3         Tran  | 300 30      1

Grandchild is updated, and Elastic Search index on Parent needs to be updated for the associated record.

So, on Grandchild updated, I need to find the Parent.ID for that Grandchild. Which means joining to Child and getting the ParentID value.

At the same time, we are starting a incremental, iterative loaded data warehouse initiative, so ideally I would like to use the same SQL Server API/technique for both.

Based on comments in How to notify a windows service(c#) of a DB Table Change(sql 2005)? by Remus Rusanu, Query Notification API should not be used, as its only intended use is cache invalidation, not change tracking...

Which seems to leave two options - SQL Server Change Data Capture, and SQL Server Change Tracking API.

We thought about doing all change tracking in the application level, but our main concerns are out-of-band updates due to the fact some data needs to be updated over night in unforeseen ways due to new government regulations, and so we really need a way to capture changes at the table-level and bubble it up into a queue to feed Elastic Search.

Thank you!

回答1:

The appropriate API for this is either change tracking or change data capture. Which one depends on the frequency/volume of data changes and on how big latency you can afford between the original data and the search index. For low latency and frequent changes CDC is better imho, as it can give you a 'delta' with minimal cost. For slow changing data and an infrequent Elastic Search index refresh maybe I would prefer CT as is more lightweight, although figuring out the 'delta' is more complex (I say maybe because in general I found CDC better suited than CT over long term solutions, as requirements evolve CDC ends up as a better fit).

The usual problem with tracking changes is finding out what was deleted. In house solutions, based on triggers or implemented in app layer, always have problems with that part. Is not impossible to do it, but you'll end up re-implementing CT/CDC on your own, without access to the internals of SQL log parsing and extra update logging that CDC leverages...



回答2:

This guy as in interesting solution using triggers, the built-in ServiceBroker to queue the changes and a C# service to read that queue and push changes to elastic search: https://medium.com/@mindingdata/elasticsearch-realtime-rivers-with-mssql-server-e1540a9bf1d3#.72k9buet5

the architecture is like a CDC but using service broker to store the changes instead of the CDC tables