Why do these Datetime values return the same Check

2019-07-21 08:45发布

问题:

I'm trying to quickly determine if two sets of schedules are identical and also generate a key which these unique schedules can be references by. I originally attempted to use HASHBYTES but quickly discovered that you are only able to hash 8000 characters and I have a significant number of datetimes which when concatenated are longer than 8000.

So, I attempted to use Checksum and Checksum_Agg since they seem to be designed for this sort of thing. I'm aware that Checksum has a higher chance of generating non-unique values. But my scope/context in which these needed to be compared to each other is so narrow I thought I could get away with it.

Unfortunately after just a little bit of testing I learned I could find Checksum "collisions" in just 4 rows of datetime data! I thought this was a bit strange and discovered a pattern to the collisions.

Below is a sample script which demonstrates the problem:

DECLARE @Rows TABLE
(
    [GroupId] INT,
    [StartDate] DATETIME,
    [EndDate] DATETIME
)

--Group1
INSERT INTO @Rows VALUES (1, '2013-01-20 01:00:00.000', '2013-01-20 01:20:00.000')
INSERT INTO @Rows VALUES (1, '2013-01-20 01:20:00.000', '2013-01-20 01:40:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 01:40:00.000', '2013-01-20 02:00:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 02:00:00.000', '2013-01-20 02:20:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 02:20:00.000', '2013-01-20 02:40:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 02:40:00.000', '2013-01-20 03:00:00.000')

--Group2
INSERT INTO @Rows VALUES (2, '2013-01-21 01:00:00.000', '2013-01-21 01:20:00.000')
INSERT INTO @Rows VALUES (2, '2013-01-21 01:20:00.000', '2013-01-21 01:40:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 01:40:00.000', '2013-01-21 02:00:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 02:00:00.000', '2013-01-21 02:20:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 02:20:00.000', '2013-01-21 02:40:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 02:40:00.000', '2013-01-21 03:00:00.000')

SELECT [ChecksumAgg1] = CHECKSUM_AGG([CheckSum])
FROM
(
    SELECT [CheckSum] = CHECKSUM([StartDate], [EndDate])
    FROM @Rows
    WHERE GroupId = 1
) G1

SELECT [ChecksumAgg2] = CHECKSUM_AGG([CheckSum])
FROM
(
    SELECT [CheckSum] = CHECKSUM([StartDate], [EndDate])
    FROM @Rows
    WHERE GroupId = 2
) G2

The Results are:

ChecksumAgg1: 5681728

ChecksumAgg2: 5681728

The only difference between the two series of dates is that they are 1 day apart. But they generate the same checksum! BUT only if there is an even number of rows. If you un-comment an INSERT from Group 1 and one from Group 2 you will get two difference Checksums. But then un-comment yet another pair and you'll get another match!


In the end I have two questions. I would love to learn a bit more about how this works and why this pattern seems to influence a pretty predictable checksum value. And even more importantly I would like to know if there's a better way to essentially create a "fingerprint" of a very large set of data. I understand that I cannot guarantee this hash to be globally unique but I apparently need something better than Checksum.

One way I was able to sort of trick the Checksum calculation was to perform a HASHBYTES on the Datetime beforehand feeding it to the Checksum function. This way the Checksum is fed values which are far more random looking than a set of dates with similar looking differences. But will it be enough?


Edit - Here's just a bit more context.

Basically I have one system which has a huge amount of schedule data and a separate system which is interested in these schedules at specific times. For example, multiple users may see a specific version of a portion this complex schedule and want to add some metadata (maybe their approval status, notes or something). If some external source makes a change to any single datetime then this link needs to be broken because it isn't the same schedule anymore!

There are many different systems which can introduce changes to the core schedule data which is why I'm having a hard time bubbling up this concern to the code level to somehow manage and "normalize" this into entities which represent each snapshot in some way. I would have to have hooks in a million places listening for changes and then cleaning up anything pointing to the schedule.

回答1:

From a comment on this page:

http://msdn.microsoft.com/en-us/library/ms188920.aspx

It appears that Checksum_Agg is built by using XORs. And the thing about XORs is, they tend to be easily reversible by including the same number twice. This explains why you've only noticed it when it's even.

As long as you're aware of the XOR issue and pre-scramble what you feed to it in a way that mixes up all the bits you should be OK.



回答2:

Do you think all this checksum stuff - given what you have to also do to ensure uniqueness - is worth the trouble? Personally I think you will get better performance (and less complexity) just comparing the columns directly instead of trying to reduce your work and compare only one value.

Also keep in mind that datetime values are just pairs of integers when you get down to it, so it is not really all that surprising that applying checksum to the combination of the two values may lead to the same values. For example:

SELECT CHECKSUM_AGG(x)
FROM
(
  SELECT CHECKSUM(1,2)
  UNION ALL 
  SELECT CHECKSUM(2,3)
) AS y(x);

SELECT CHECKSUM_AGG(x)
FROM
(
  SELECT CHECKSUM(2,2)
  UNION ALL 
  SELECT CHECKSUM(1,3)
) AS y(x);

Results:

----
49

----
49

So I suggest just put an index on StartDate, EndDate and be done with it. You're trying to make something that's already pretty efficient more efficient, and I think you are accomplishing the opposite.

As for the key, just use an IDENTITY column or some other surrogate. I see no advantage to nesting CHECKSUM_AGG(CHECKSUM(HASHBYTES(col1),HASHBYTES(col2))) to simulate uniqueness...

EDIT

Or given the new requirement, just use a ROWVERSION column if you want to make sure the data is the same as the last time you read it. I don't see how tracking millions of checksum results is any different from tracking rowversion or otherwise calculated values. You're working way too hard when there are already built-in things that do what you're trying to do...



回答3:

i faced this isssue too. That appears when you have all values in column same. Probably it doesn't take this column, when calculating sum.