Relational vs. Dimensional Databases, what's t

2019-03-08 05:24发布

问题:

I'm trying to learn about OLAP and data warehousing, and I'm confused about the difference between relational and dimensional modeling. Is dimensional modeling basically relational modeling, but allowing for redundant/un-normalized data?

For example, let's say I have historical sales data on (product, city, # sales). I understand that the following would be a relational point-of-view:

Product | City | # Sales
Apples, San Francisco, 400
Apples, Boston, 700
Apples, Seattle, 600
Oranges, San Francisco, 550
Oranges, Boston, 500
Oranges, Seattle, 600

While the following is a more dimensional point-of-view:

Product | San Francisco | Boston | Seattle
Apples, 400, 700, 600
Oranges, 550, 500, 600

But it seems like both points of view would nonetheless be implemented in an identical star schema:

Fact table: Product ID, Region ID, # Sales
Product dimension: Product ID, Product Name
City dimension: City ID, City Name

And it's not until you start adding some additional details to each dimension that the differences start popping up. For instance, if you wanted to track regions as well, a relational database would tend to have a separate region table, in order to keep everything normalized:

City dimension: City ID, City Name, Region ID
Region dimension: Region ID, Region Name, Region Manager, # Regional Stores

While a dimensional database would allow for denormalization to keep the region data inside the city dimension, in order to make it easier to slice the data:

City dimension: City ID, City Name, Region Name, Region Manager, # Regional Stores

Is this correct?

回答1:

A star schema really lies at the intersection of the relational model of data and the dimensional model of data. It's really a way of starting with a dimensional model, and mapping it into SQL tables that somewhat resemble the SQL tables you get if you start from a relational model.

I say somewhat resemble because many relational design methodologies result in a normalized design, or at least a nearly normalized design. A star schema will have significant departures from full normalization.

Every departure from full normalization carries with it a consequent data update anomaly. (I'm including anomlaies on insert, update and delete operations under one umbrella). Those anomalies don't have anything to do with what data model you started with.

The comment on OLTP versus OLAP is relevant here. Update anomalies will have different impacts on performance and/or programming difficulty in those two situations.

In addition to a star schema in an SQL databaase, there are dimensional database products out there that store data in a physical form that is unique to that product. With those products, you don't see a star schema so much as you see a direct implementation of the dimensional model, and an interface that might be peculiar to the product. Some of those interfaces allow OLAP operations to be completely point-and-click.

Just as a digression from your question, I once built a star schema as an intermediate step between an OLTP database that supported a transaction based application and a datacube inside Cognos PowerPlay. Using standard ETL techniques, the combined transfer from the OLTP database to the star schema and then from the star schema to the data cube actually outperformed the direct transfer from the OLTP database to the datacube. This was an unexpected result.

Hope this helps.



回答2:

In simple words OLTP normalized database are designed with most optimal "transactional" point of view. Databases are normalized to work optimally to a transactional system. When I say optimization of transactional system i mean ..getting to a design state of database structure where all transactional operations like delete,insert,update and select are balanced to give equal or optimum importance to all of them at any point of time...as they are equally valued in a transactional system.

And that what a normalized system offer ..minimal updates possible for a data update,minimal insert possible for new entry,one place delete for category deletion etc (e.g. new product category )...all this is possible a we branch a create master tables .....but this comes at the cost of "select" operation delay ..but as I said its(normalization) not most efficient model for all operations ..its "Optimal"...having said we get other methods to enhance data fetching speed..like indexing etc

On the other hand Dimensional model (mostly used for data-ware house design)..meant for giving importance to only one kind of operations thats Selection of data...as in data-ware houses ..data update/insertion happens periodically ..and its a one time cost.

So if one try to tweak normalized data structure so that only selection is the most important operation at any point in time ...we will end up getting a denormalized (I would say partially denormalized)..dimensional star structure.

  • all the foreign keys a one place Fact -no dimension to dimension join (i.e. master to master table join)..snowflake represent same dimension
    • ideally designed facts carry only numbers ..measures or foreign keys
    • dimension are used to carry description and non aggregatable info
    • redundancy of data is ignored ...but in rare cases if Dimensions itself grow too much .snowflake design is seen as option..but that still is avoidable

For details please go through detailed books on this topic.



回答3:

I have just recently read up on the difference between Dimensional and Relational Data Modeling since we primarily use Relational models at my business where we store an Enterprise Data Warehouse (EDW).

According to Steve Hoberman in his book "Data Modeling Made Simple" the distinction between the 2 types of models is this:

  • Relational Data Models captures the business solution for how part of the business works, a.k.a business process
  • Dimensional Data Models capture the details the business needs to answer questions about how well it is doing

It can be argued that a relational model can also be used as a foundation upon which to answer business questions, but at a tactical level. "How many orders are in an unfulfilled state for customer x due to credit hold?" But the distinction is that of where the reporting question needs the 'native grain' of the table and when the reporting question can be answered with summarized data.

In your above 2 examples they are actually both examples of Dimensional data modeling since neither of the 2 tables are storing the Sales Order at its 'native grain', and therefore does not capture the business process of creating a sales order. The only difference between the 2 tables is that in the 2nd table the city dimension has been transposed into the fact table.



回答4:

I found the description I found on http://www.orafaq.com/node/2286 to be very helpful when coming to star schema's from a relational perspective.

Consider a fully normalized data model. Now think of exactly the opposite, where you fully denormalize your relational data model so that you have only one flat record like a big'ol spreadsheet with a very wide row. Now back up from this flat record just a little bit so that you have a data model that is only two levels deep; one big table, and several small tables that the big table points back to. This is a STAR schema. Thus a true star data model has two attributes, it is always two levels deep, and a true star model always contains only one large table that is the focus of the model.