Benefits of using Staging Database while designing

2020-05-19 04:10发布

问题:

I am in process of designing a Data Warehouse Architecture. While exploring various options to Extract data from Production and putting into Data Warehouse, I came across many articles which mainly suggested following two approaches -

  1. Production DB ----> Data Warehouse (Star Schema) ----> OLAP Cube
  2. Production DB ----> Staging Database ----> Data Warehouse (Star Schema) ----> OLAP Cube

I am still not sure which one is the better approach in terms of Performance and reducing processing load on Production database.

Which approach you find better while designing Data Warehouse ?

回答1:

Below points are taken from, DWBI Organization's article

Staging area may be required if you have any of the following scenarios:

  1. Delta Loading: Your data is read incrementally from the source and you need an intermediate storage where incremental set of your data can be stored temporarily for transformation purpose
  2. Transformation need: You need to perform data cleansing, validation etc. before consuming the data in the warehouse
  3. De-coupling: Your processing takes lot of time and you do not want to remain connected to your source system (presumably the source system is being constantly used by the actual business users) during the entire time of your processing and, hence, prefer to just read the data from source system in one-go, disconnect from the source and then continue processing the data at your "own side"
  4. Debugging purpose: You need not go back to your source all the time and you can troubleshoot issues (if any) from staging area alone
  5. Failure Recovery: Source system may be transitory and the state of the data may be changing. If you encounter any upstream failure, you may not be in a position to re-extract your data as source has changed by that time. Having a local copy helps

Performance and reduced processing may not be only considerations. Adding a staging may sometimes increase latency (i.e. time delay between occurrence of a business incidence and it's reporting). But I hope above points will help you to make a better judgement.



回答2:

ETL = Extract, Transform and Load. Staging database's help with the Transform bit. Personally I always include a staging DB and ETL step.

A Staging database assists in getting your source data into structures equivalent with your data warehouse FACT and DIMENSION destinations. It also decouples your warehouse and warehouse ETL process from your source data.

If your data warehouse destination tables pretty much map your production DB tables with only some additional dimension fields then you could get away with ignoring the Staging Database. This will save you a little development time. I don't recommend this as you:

  1. Will end up tying your data warehouse solution directly to your source database
  2. Will most likely end up with a very complicated ETL step
  3. May end up with race conditions/orphan records due to changes in your source database during the ETL process
  4. Data Warehousey people may make 'hrumph' type sounds at you

Most likely though you will be performing some sort of data manipulation (converting dates to DATE_DIM keys, aggregating values) in which case a staging DB will help you separate your transformation logic and calculations from your data warehouse operations (dimensioning data).

You may have also come across this sort of pattern:

[PROD DB] -(ETL)->  [RAW DB] -(ETL)-> [STAGING DB] -(ETL)-> [DW DB]  -(ETL)-> [DM DB]

which if performance considerations are important you may want to look at. In your case the RAW_DB could be an exact 1:1 copy of your production database and the ETL step that creates it might just be a recreate the DB from the most recent nightly backup. (Traditionally RAW_DB was used for getting data from various external sources with each field as pure text, these fields where then converted to their expected data type with exceptions handled as encountered. This is not so much of a problem when you have one source and its a nice strongly typed normalized database)

From this RAW_DB the next ETL process would truncate and populate staging such that the STAGING DB contains all the new/updated records that are going into the warehouse.

Another added benefit of all these steps is that it really assists in debugging weird data as for any given run you can see record values inside each of the difference databases and identify which ETL process is introducing the sadness.



回答3:

There are a few potential advantages of using an intermediary staging database, which may or may not apply to your situation. There is no perfect, one-size fits all solution. Some of the potential advantages include:

  • If it is appropriate, you can take a snapshot of your production database (you may have a daily backup or hot-site snapshot already) and then do your ETL from the restored backup or snapshot. This could save load on your production database.
  • You may need complicated processing for your ETL which requires many intermediate tables which have no use except for the ETL process. You may not want to clutter your data warehouse with these intermediary tables.
  • Your raw data may not be available all at once and you need somewhere to accumulate it before starting your ETL process to build your data warehouse.
  • Your data warehouse may have production window requirements which can't be met by your ETL and so you need to stage your "output" (i.e. new records for the data warehouse) rather than or in addition to your production database.
  • The production system may be in a highly secured environment and for whatever reason a decision may have been made to not allow the ETL process full access to the raw production data. The group that controls the production database may want to extract only the necessary data to a staging database so that the ETL process can only see what it needs. I've seen this where the production system and ETL process are managed by different third-party vendors.
  • It may be that your ETL process creates large intermediate tables. Sometimes space management is easier if you start with an empty model database for your ETL staging area and then "throw it away" each day rather than trying to recover the space in a more surgical way, as you might do with a production or reporting database.

There are possible disadvantages too, which may or may not matter to you. Chief among these is having to have another database server. A lot of the advantages could be meaningless if you are using the same server to host the production and/or data warehouse databases.



回答4:

Really staging area is not a necessity if we can handle it on the fly. But can we? Here are a few reasons why you can’t avoid a staging area: 1. Source systems are only available for extraction during a specific time slot which is generally lesser than your overall data loading time. It’s a good idea to extract and keep things at your end before you lose the connection to the source systems. 2. You want to extract data based on some conditions which require you to join two or more different systems together. E.g. you want to only extract those customers who also exist in some other system. You will not be able to perform a SQL query joining two tables from two physically different databases. 3. Various source systems have different allotted timing for data extraction. 4. Data warehouse’s data loading frequency does not match with the refresh frequencies of the source systems 5. Extracted data from the same set of source systems are going to be used in multiple places (data warehouse loading, ODS loading, third-party applications etc.) 6. ETL process involves complex data transformations that require extra space to temporarily stage the data 7. There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations

Clearly staging area gives lot flexibility during data loading. Shouldn’t we have a separate staging area always then? Is there any impact of having a stage area? Yes there are a few. 1. Staging area increases latency – that is the time required for a change in the source system to take effect in the data warehouse. In lot of real time / near real time applications, staging area is rather avoided Data in the staging area occupies extra space 2. To me, in all practical senses, the benefit of having a staging area outweighs its problems. hence, in general I will suggest designating a specific staging area in data warehousing projects.