Star schema, normalized dimensions, denormalized h

2019-01-25 08:13发布

问题:

Given the following star schema tables.

  • fact, two dimensions, two measures.

#   geog_abb  time_date amount     value
#1:       AL 2013-03-26  55.57 9113.3898
#2:       CO 2011-06-28  19.25 9846.6468
#3:       MI 2012-05-15  94.87 4762.5398
#4:       SC 2013-01-22  29.84  649.7681
#5:       ND 2014-12-03  37.05 6419.0224
  • geography dimension, single hierarchy, 3 levels in hierarchy.

#   geog_abb  geog_name geog_division_name geog_region_name
#1:       AK     Alaska            Pacific             West
#2:       AL    Alabama East South Central            South
#3:       AR   Arkansas West South Central            South
#4:       AZ    Arizona           Mountain             West
#5:       CA California            Pacific             West
  • time dimension, two hierarchies, 4 levels in each.

#    time_date time_weekday time_week time_month time_month_name time_quarter time_quarter_name time_year
#1: 2010-01-01       Friday         1          1         January            1                Q1      2010
#2: 2010-01-02     Saturday         1          1         January            1                Q1      2010
#3: 2010-01-03       Sunday         1          1         January            1                Q1      2010
#4: 2010-01-04       Monday         1          1         January            1                Q1      2010
#5: 2010-01-05      Tuesday         1          1         January            1                Q1      2010

Examples is stripped of surrogate keys to improve readability. In results there are levels in hierarchy without other attributes, just don't bother that, they are still levels in hierarchy.


In star schema expressed as:

         GEOGRAPHY (all fields)
        /
       /
   FACT
       \
        \
         TIME (all fields)

In snowflake schema expressed as:

                    geog_region_name
                   /
                  geog_division_name
                 /
                geog_abb (+ geog_name)
               /
              /
          FACT
              \
               \
                time_date
                   |
hierarchies:       |
        weekly    / \    monthly
                 /   \ 
                /     \
   time_weekday         time_month (+ time_month_name)
            |             |
            |             |
        time_week     time_quarter (+ time_quarter_name)
            |             |
            |             |
        time_year      time_year

How would you call the following schema

Does it have any specific name? Starflake? :)

        |>-- geog_region_name
        |
        |>-- geog_division_name
        |
        |>-- geog_abb (+ geog_name)
        |
        |
        geography base
       /
      /
  FACT
      \
       \
        time base
        |
        |
        |>-- time_date
        |
        |>-- time_weekday
        |
        |>-- time_week
        |
        |>-- time_month (+ time_month_name)
        |
        |>-- time_quarter (+ time_quarter_name)
        |
        |>-- time_year

It basically has a dimension base table storing identities of every level of every hierarchy within a dimension. No need for recursive walk through snowflake's levels, potentially less joins. Data still well normalized, only keys are denormalized into base table. All levels from all hierarchies tied to lowest grain key of a dimension in dimension base.
Additionally having a dimension base table allows to handle time variant attributes/temporal queries just in that table, at the granularity of a hierarchy level.

Here is the tabular representation.

Still on natural keys!

  • fact

#    geog_abb  time_date amount     value
# 1:       AK 2010-01-01 154.43 12395.472
# 2:       AK 2010-01-02  88.89  6257.639
# 3:       AK 2010-01-03  81.74  7193.075
# 4:       AK 2010-01-04 165.87 11150.619
# 5:       AK 2010-01-05   8.75  6953.055
  • time dimension base

#     time_date time_year time_quarter time_month time_week time_weekday
# 1: 2010-01-01      2010            1          1         1       Friday
# 2: 2010-01-02      2010            1          1         1     Saturday
# 3: 2010-01-03      2010            1          1         1       Sunday
# 4: 2010-01-04      2010            1          1         1       Monday
# 5: 2010-01-05      2010            1          1         1      Tuesday
  • time dimension normalization to hierarchy levels

#    time_year
# 1:      2010
# 2:      2011
# 3:      2012
# 4:      2013
# 5:      2014

#    time_quarter time_quarter_name
# 1:            1                Q1
# 2:            2                Q2
# 3:            3                Q3
# 4:            4                Q4

#    time_month time_month_name
# 1:          1         January
# 2:          2        February
# 3:          3           March
# 4:          4           April
# 5:          5             May

#    time_week
# 1:         1
# 2:         2
# 3:         3
# 4:         4
# 5:         5

#    time_weekday
# 1:       Friday
# 2:       Monday
# 3:     Saturday
# 4:       Sunday
# 5:     Thursday

#     time_date time_week time_weekday time_year
# 1: 2010-01-01         1       Friday      2010
# 2: 2010-01-02         1     Saturday      2010
# 3: 2010-01-03         1       Sunday      2010
# 4: 2010-01-04         1       Monday      2010
# 5: 2010-01-05         1      Tuesday      2010
  • geography dimension base

#    geog_abb geog_region_name geog_division_name
# 1:       AK             West            Pacific
# 2:       AL            South East South Central
# 3:       AR            South West South Central
# 4:       AZ             West           Mountain
# 5:       CA             West            Pacific
  • geography dimension normalization to hierarchy levels

#    geog_region_name
# 1:    North Central
# 2:        Northeast
# 3:            South
# 4:             West

#    geog_division_name
# 1: East North Central
# 2: East South Central
# 3:    Middle Atlantic
# 4:           Mountain
# 5:        New England

#    geog_abb  geog_name geog_division_name geog_region_name
# 1:       AK     Alaska            Pacific             West
# 2:       AL    Alabama East South Central            South
# 3:       AR   Arkansas West South Central            South
# 4:       AZ    Arizona           Mountain             West
# 5:       CA California            Pacific             West

Dimension base could store also primary key's attributes, this would de-duplicate dimension's lowest level but will be less consistent (time_date levels from both hierarchies would fit into time dimension base tables).


What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity.
Does it have any name? It is being use? If not why?

回答1:

You are building a snowflake schema with shortcuts.

It's used and BI tools can easily use the shortcuts.

You can also have shortcuts from a parent level of a dimension to a fact table at child level for that dimension. It works, you can skip a join, but you need to store an additional column in the fact table.

The only concern is about data integrity, if a parent-child relationship changes you need to update not only the child table, but also all other tables where this relationship is stored.

It's not a big deal if you generate every time your dimension table from your normalize data, but you need to be careful, even more if you store a parent ID in the fact table.



回答2:

What you are doing is not a snowflake schema ...it is similar to "Data Vault" and our own variation "Link-Model". It essentially creates link tables just containing keys which sit between Fact tables and Dim tables (and other Dim tables). Although, we describe them as entity tables and measure tables.

The advantages are

  • You can parallel load dimension and fact tables, then populate the link tables
  • Complicated practices like "as at reporting" with "Adjustments" as found in Insurance can be handled quite readily
  • It is more intuitive to split slowly and quickly changing dimension Dimension tables that are just linked by the link tables. This is a time saving.
  • Adding new dimensions to fact tables is fairly simple and quick, after all it is just adding an extra integer column to a table containing just integers.
  • Factless facts are far more intuitive than in a conventional schema. You can create relationships between dimensions, without any fact record.

The downsides are

  • A slightly more complicated schema structure, so we generally create a Kimball models on top of the "Link Model", as business users tend to understand it well.
  • To add a new dimension to a fact table or to extend a dimension table can be easily done, but the schema can become cluttered over time.