Why NULL values are mapped as 0 in Fact tables?

2019-01-28 01:36发布

问题:

What is the reason that in measure fields in fact tables (dimensionally modeled data warehouses) NULL values are usually mapped as 0?

回答1:

Although you've already accepted another answer, I would say that using NULL is actually a better choice, for a couple of reasons.

The first reason is that aggregates return the 'correct' answer (i.e. the one that users tend to expect) when NULL is present but give the 'wrong' answer when you use zero. Consider the results from AVG() in these two queries:

-- with zero; gives 1.5
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select 0
) dt

-- with null; gives 2
select SUM(measure), AVG(measure)
from
(
select 1.0 as 'measure'
union all
select 2.0
union all
select 3.0
union all
select null
) dt

If we assume that the measure here is "number of days to manufacture item" and NULL represents an item that is still being produced then zero gives the wrong answer. The same reasoning applies to MIN() and MAX() too.

The second issue is that if zero is a default value, then how do you distinguish between zero as a default and zero as a real value? For example, consider a measure of "shipping charges in EUR" where NULL means that the customer picked up the order himself so there were no shipping charges and zero means the order was shipped to the customer for free. You can't use zero to replace NULL without completely changing the meaning of the data. You can obviously argue that the distinction should be clear from other dimensions (e.g. shipping method) but that adds more complexity to reports and understanding the data.



回答2:

It depends upon what you're modeling, but in general it's to avoid complications with performing aggregates. And in many scenarios it makes sense to treat NULL as 0 for those purposes.

For example, a customer with NULL orders for a given period of time. Or a sales person with NULL sales revenue (shame on him!).



回答3:

The main reason is that the database treats nulls differently from blanks or zeros, even though they look like blanks or zeros to the human eye.

Here is a link to an old design tip by Ralph Kimball on the same topic.

This blogpost talks about avoiding nulls in measures and gives a couple of suggestions.



回答4:

NULL instead of 0 should be used if you intend to do an average on your fact column. This is the only time i believe NULLS are ok in a dwh fact or dimensions

if a fact value is unknown/late arriving, then leaving as NULL is best.

aggregate functions suchs as MIN,MAX work on NULLS simply ignoring them

(For the record one of Ralph Kimball's sidekicks said this in his course I intended)

with goodf as
(
select 1  x
union all
select null 
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
from goodf


with badf as
(
select 1  x
union all
select 0 /* unknown */ 
union all
select 4
)
select sum(x) sumx,min(x) minx,max(x) maxx,avg(cast(x as float)) avgx 
from badf

in badf above the average comes out incorrect as it uses the zero of the unknown value as literally 0