What is the best approach to pull “Delta” data int

2019-06-14 12:36发布

问题:

What is the best approach to load only the Delta into the analytics DB from a highly transactional DB?

Note: We have a highly transactional system and we are building an analytic database out of it. At present, we are wiping off all the fact and dimension tables from the analytics DB and loading the entire "processed" data at midnight. Problem with this approach is that, we are loading the same data again and again every time along with the few new data that got added/updated on that particular day. We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?

回答1:

It is difficult to tell something without knowing the details e.g. the database schema, the database engine... However the most natural approach for me is to use timestamps. This solution assumes that entities (single record in a table, or group of related records) that are loaded/migrated from a transactional DB into an analytic one have a timestamp.

This timestamp says when given entity was created or updated the last time. While loading/migrating data you should take into account only these entities for each the timestamp > the date of the last migration. This approach has this advantage that is quite simple and does not require any specific tool. The question is if you already have timestamps in your DB.

Another approach might be to utilize some kind of change tracking mechanism. For example MMSQL server has something like that (see this article). However, I have to admit that I've never used it so I'm not sure if it is suitable in this case. If your database doesn't support change tracking, you can try to create it on your own based on triggers, but in general it is not easy thing to do.



回答2:

We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?

You forgot rows that got deleted. And that is the crux of the problem. Having a updated_at field on every table and polling for rows with updated_at > @last_poll_time works, more or less, but polling like this does not give you a transaction ally consistent image because each table is polled at a different moment. Tracking deleted rows induces complications at app/data model layer, as rows have to be either logically deleted (is_deleted) or moved to an archive table (for each table!).

Another solution is to write triggers in the database, attach a trigger to each table, and have the trigger write into table_history the changes that occurred. Again, for each table. These solutions are notoriously difficult to maintain long term in presence of schema changes (columns added, modified, tables dropped etc etc)

But there are database specific solution that can help. For instance SQL Server has Change Tracking and Change Data Capture. These can be leveraged to build an ETL pipeline that maintains an analytical data warehouse. Database schema changes are still a pain, though.

There is no silver bullet, no pixie dust.