What conditions cause CHECKSUM_AGG to return 0?

2019-08-05 06:02发布

问题:

It seems that there are a number of conditions that cause CHECKSUM_AGG to return 0 that I wouldn't have expected. I've only been able to find one discussed, which is that duplicate values will cause it. That can be solved via a DISTINCT or GROUP BY.

I've also found a couple more scenarios that make less sense to me. One was provided by my boss and one I found MSDN. These I don't know how to explain. Here is some SQL that demonstrates the scenarios:

SELECT
    CHECKSUM_AGG(T.Number) AS ChecksumAgggregate
FROM
(
    VALUES
        (2)
        , (3)
        , (4)
        , (5)
)AS T(Number)

DECLARE @t TABLE 
(
    Category VARCHAR(15),
    Value VARCHAR(10)
)

INSERT @t 
(
    Category
    , Value
)
VALUES 
    ('OneCharacter','a')
    ,('OneCharacter','b')
    ,('OneCharacter','c')
    ,('OneCharacter','d')

    ,('TwoCharacters','aa')
    ,('TwoCharacters','bb')
    ,('TwoCharacters','cc')
    ,('TwoCharacters','dd')

    ,('ThreeCharacters','aaa')
    ,('ThreeCharacters','bbb')
    ,('ThreeCharacters','ccc')
    ,('ThreeCharacters','ddd')

    ,('SixCharacters','aaaaaa')
    ,('SixCharacters','bbbbbb')
    ,('SixCharacters','cccccc')
    ,('SixCharacters','dddddd')

    ,('AllValues','a')
    ,('AllValues','b')
    ,('AllValues','c')
    ,('AllValues','d')
    ,('AllValues','aa')
    ,('AllValues','bb')
    ,('AllValues','cc')
    ,('AllValues','dd')
    ,('AllValues','aaa')
    ,('AllValues','bbb')
    ,('AllValues','ccc')
    ,('AllValues','ddd')
    ,('AllValues','aaaaaa')
    ,('AllValues','bbbbbb')
    ,('AllValues','cccccc')
    ,('AllValues','dddddd')

select 
    Category, CHECKSUM_AGG(CHECKSUM(Value)) 
from @t 
group by Category

select Category, Value, CHECKSUM(Value) ValueChecksum
from @t

It's nothing but 0's in these examples from CHECKSUM_AGG from these queries. The last query shows that none of the CHECKSUM values that are getting input into the CHECKSUM_AGG call are duplicated.

I'm hoping that whatever answer describes what causes CHECKSUM_AGG to return 0 will explain these situations as well.

回答1:

This can happen if, for example, there is an even number of rows, or if the values sum up to certain values (e.g. 14). These both yield 0:

SELECT CHECKSUM_AGG(N)
FROM (VALUES (1),(2),(3),(4),(4)) AS T(N);

SELECT CHECKSUM_AGG(N)
FROM (VALUES (2),(3),(4),(5)) AS T(N);

But these do not:

SELECT CHECKSUM_AGG(N)
FROM (VALUES (1),(2),(3),(4),(5)) AS T(N);

SELECT CHECKSUM_AGG(N)
FROM (VALUES (2),(3),(4),(6)) AS T(N);

14 may just be a coincidence (and it doesn't happen for that value across 6 rows), I mention it only because it's the only pattern I noticed here. I don't know that any of these cases are documented.

Also you are aware that CHECKSUM_AGG is officially documented as being not reliable, i.e. it may not accurately reflect a data change, right? The doc doesn't go into details on this, it just states that sometimes it won't tell you there's been a change even if there has. (The actual wording is "However, there is a small chance that the checksum will not change.")

Anyway Michael Swart seems to have discovered that CHECKSUM_AGG is just XOR (here is his proof):

CREATE TABLE #f(a FLOAT);
GO

INSERT #f VALUES (RAND());
GO 20

DECLARE @i INT = 0;
SELECT @i = @i ^ CHECKSUM(a) FROM #f;
SELECT @i, CHECKSUM_AGG(CHECKSUM(a)) FROM #f;
GO

DROP TABLE #f;
GO

As he goes on to explain, this can lead to misleading results when the range of values is very small, since the odds of generating a reliable checksum diminishes rapidly.