Design a dimension with multiple data sources

2019-09-02 07:47发布

问题:

I am designing a few dimensions with multiple data sources and wonder what other people have done to handle the multiple surrogate business keys.

My Example: I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact.

In both the Ordering and Execution system they is a Location table. The business keys from both systems are mapped via an esb . There are attributes in both systems that make up the complete picture about a single location. Billing information is in the Ordering system, latitude and longitude are in the Execution system. And Location Name exists in both systems.

How do you design a SCD accomodate changes from both systems to the dimension?

We follow a fairly strict Kimball methodology - fyi, but I am open to looking at everyone's solutions.

回答1:

Not necessarily an answer but here are my thoughts:

You've already covered the real options in your comment. Either:

A. Merge it beforehand

You need some merge functionality in staging which matches the two (or more) records, creates a new common merge key and uses that in the dimension. This requires some form of lookup or reference to be stored in addition to normal DW data

OR

B. Merge it in the dimension

Put both records in the dimension and allow the reporting tool to 'merge' it by, for example, grouping by location name. This means you don't need prior merging logic you just dump it in the dimension

However you have two constraints that I feel makes the choice between A & B clearer

Firstly, you need an SCD (Type 2 I assume). This means Option B could get very complicated as when there is a change in one source record you have to go find the the other record and change it as well - very unpleasant for option B. You still need some kind of pre-stored key to link them, which means option B is no longer simple

Secondly, given that you have two sources for one attribute (Location Name), you need some kind of staging logic to pick a single name when these don't match

So given these two circumstances, I suggest that option A would be best - build some pre-merging logic, as the complexity of your requirements warrants it.

You'd think this would be a common issue but I've never found a good online reference explaining how someone solved this before.



回答2:

My thinking is actually very trivial. First you need to be able to conclude what is your master dataset on Geo+Location and granularity.

My method will be:

DIM loading

Say below is my target

Dim_Location = {Business_key, Longitude, Latitude, Location Name}

Dictionary

Business_key = Always maps to master record from source system (in this case it is the execution system). Imagine now the unique key from business is combined (longitude, latitude) for this table.

Location Name = Again, since we assume the "Execution system" is master for our data then it will host from Source="Execution System".

The above table is now loaded for Fact lookup.

Fact Loading

You have already integrated record between execution system and billing system. It's a straight forward lookup and load in staging since it exists with necessary combination of geo_location.

Challenging scenarios

What if execution system has a late arriving records on orders? What if same geo_location points to multiple location names? Not possible but worth profiling the data for errors.