Nulls in dimension table for numeric attributes

2019-05-28 13:05发布

问题:

What is the best way to handle missing values in a dimension table?

In the case of a textual column, it is easy to write "NA: Missing," but what should be done for numeric columns where it is important to retain the specific values. Note: I do not want a solution that uses banded values (e.g., textual columns for "0-50", "50-100", "NA: Missing").

For instance, a customer dimension may have a year-of-birth. How should missing years of birth be handled? Leave it null? Add in an arbitrary number as a placeholder such as 1900?

Sometimes, it may be difficult to find a placeholder number. For instance, if sales-to-date are non-negative, but can be zero I wouldn't want to put "0" as a placeholder for null. I could use negative values such as "-1", but that would ruin queries that use sums.

回答1:

In your fact table you never use a null value for a foreign key, but you can and should use null values for the metrics where appropriate. A null value will give accurate results when aggregated, where a default value will not.

In dimension tables also the attributes can and should be null where appropriate, for the same reason. While it's less common to do aggregation of the dimension values, it does happen, so it should be right when it happens.

If you have a need for an empty value in a dimension then the dimension should have a row for the purpose. The Date Dimension for instance might have 3 or 4 special rows - no value, unknown, past and future are reasonable special value rows, depending on your needs.

You will save yourself a lot of pain and suffering in the BI layer this way.