What is the best approach to get from relational O

2019-03-09 13:25发布

问题:

I have a fairly standard OLTP normalised database and I have realised that I need to do some complex queries, averages, standard deviations across different dimensions in the data.

So I have turned to SSAS and the creation of OLAP cubes.

However to create the cubes I believe my data source structure needs to be in a 'star' or 'snowflake' configuration (which I don't think it is right now).

Is the normal procedure to use SSIS to do some sort of ETL process on my primary OLTP DB into another relational DB that is in the proper 'star' configuration with facts and dimensions, and then use this DB as the datasource for the OLAP cubes?

Thanks

回答1:

Yes, that is the basic idea. You take your highly normalized OLTP database and de-normalize it into cubes for the purpose of slicing and dicing the data and then presenting reports on it. The logical design technique is called dimensional modeling. There is a ton of great information about dimensional modeling over at the Kimball Group. Ralph Kimball's books on the subject are also excellent. If you want to learn more about the BI tools themselves, check out the virtual labs on SSIS, analysis services and more.



回答2:

The answer is: yes, but.

A dimension in SSAS has relationships between attributes that can be used a a series of fields to filter of slice by. These relationships can be hierarchical (more than one level deep - one attribute can have a parent and children. You can also establish drill down paths (called hierarchies in SSAS) that act like attributes but have a guided drilldown.

In order to do this you need to have keys available in the database that live in a strictly hierarchical relationship (i.e. the keys can't have fuzzy relationships where a child can have more than one parent). Note that this isn't the whole story but it's close enough to reality for the moment.

These hierarchies can be constructed from a flat data structure by the system or presented through a snowflake with relationships marked up in the underlying data source view (DSVs are a part of the cube metadata and can be used to massage data in a manner similar to database views).

A snowflake is a 3NF-ish schema (it doesn't strictly have to be 3NF - you can flatten parts of it in practice) that only has 1:M relationships. SSAS can support a few other dimension structures such as parent-child (parent-child relationship with a recursive self-join) and M:M dimensions (M:M relationships - exactly what they sound like). Dimensions of this type are more fiddly but may be useful to you.

If you have keys in your source data that can have equivalent data semantics to a snowflake then you may be able to populate your cube through a series of database views on your source system that present the underlying data in a sufficiently snowflake-like format to use for cube dimensions (I have actually done this on a couple of occasions). Schemas that make heavy use of synthetic keys are more likely to work well for this.

If your vendor or other parties won't let you add views to your source database you may be able to use a data source view instead. DSV's can have virtual tables called 'named queries' that are populated from a database query.

Fact tables join to dimensions. In SSAS2005+ you can join different fact tables at different grains within a dimension. I wouldn't normally have much use for this in a data warehouse, but this feature might be useful if you're trying to use the source data without having to massage it too heavily.

If this doesn't work then you may well have to write an ETL process to populate a star or snowflake schema.

A few provisos:

  1. Cubes can be made to run in a real-time mode where they just issue a query to the underlying data. This has some risk of creating inefficient queries against your source data, so it is not recommended unless you are really confident that you know what you're doing.

  2. Apropos of (i), you probably won't be able use the cubes as a data source for screens in your application. If you need to calculate averages for something that the user wants to see on a screen you will probably have to calculate it in a stored procedure behind the screen.

  3. If you do this, set up a replicated database and populate the cube off that. Have this database periodically refresh, so your ETL process can run from an internally consistent data set. If you run from a live database, you risk some of the items being populated later on that depend on records that were created after the corresponding process was run.

    You can have the situation where you run a dimension load, and then new data is entered into the system. When the fact table load runs, it now contains data that is dependent on dimension data that hasn't been loaded. This will break the cube and cause the load process to fail. Batch refreshing a replicated database to run the ETL or cube loads off will mitigate this issue.

    If you don't have the option of a replicated database you can set up more slack policies for missing data.

  4. If your underlying production data has significant data quality issues they will be reflected in the cubes. GIGO.