I'm trying to quickly determine if two sets of schedules are identical and also generate a key which these unique schedules can be references by. I originally attempted to use HASHBYTES but quickly discovered that you are only able to hash 8000 characters and I have a significant number of datetimes which when concatenated are longer than 8000.
So, I attempted to use Checksum and Checksum_Agg since they seem to be designed for this sort of thing. I'm aware that Checksum has a higher chance of generating non-unique values. But my scope/context in which these needed to be compared to each other is so narrow I thought I could get away with it.
Unfortunately after just a little bit of testing I learned I could find Checksum "collisions" in just 4 rows of datetime data! I thought this was a bit strange and discovered a pattern to the collisions.
Below is a sample script which demonstrates the problem:
DECLARE @Rows TABLE ( [GroupId] INT, [StartDate] DATETIME, [EndDate] DATETIME ) --Group1 INSERT INTO @Rows VALUES (1, '2013-01-20 01:00:00.000', '2013-01-20 01:20:00.000') INSERT INTO @Rows VALUES (1, '2013-01-20 01:20:00.000', '2013-01-20 01:40:00.000') --INSERT INTO @Rows VALUES (1, '2013-01-20 01:40:00.000', '2013-01-20 02:00:00.000') --INSERT INTO @Rows VALUES (1, '2013-01-20 02:00:00.000', '2013-01-20 02:20:00.000') --INSERT INTO @Rows VALUES (1, '2013-01-20 02:20:00.000', '2013-01-20 02:40:00.000') --INSERT INTO @Rows VALUES (1, '2013-01-20 02:40:00.000', '2013-01-20 03:00:00.000') --Group2 INSERT INTO @Rows VALUES (2, '2013-01-21 01:00:00.000', '2013-01-21 01:20:00.000') INSERT INTO @Rows VALUES (2, '2013-01-21 01:20:00.000', '2013-01-21 01:40:00.000') --INSERT INTO @Rows VALUES (2, '2013-01-21 01:40:00.000', '2013-01-21 02:00:00.000') --INSERT INTO @Rows VALUES (2, '2013-01-21 02:00:00.000', '2013-01-21 02:20:00.000') --INSERT INTO @Rows VALUES (2, '2013-01-21 02:20:00.000', '2013-01-21 02:40:00.000') --INSERT INTO @Rows VALUES (2, '2013-01-21 02:40:00.000', '2013-01-21 03:00:00.000') SELECT [ChecksumAgg1] = CHECKSUM_AGG([CheckSum]) FROM ( SELECT [CheckSum] = CHECKSUM([StartDate], [EndDate]) FROM @Rows WHERE GroupId = 1 ) G1 SELECT [ChecksumAgg2] = CHECKSUM_AGG([CheckSum]) FROM ( SELECT [CheckSum] = CHECKSUM([StartDate], [EndDate]) FROM @Rows WHERE GroupId = 2 ) G2
The Results are:
ChecksumAgg1: 5681728
ChecksumAgg2: 5681728
The only difference between the two series of dates is that they are 1 day apart. But they generate the same checksum! BUT only if there is an even number of rows. If you un-comment an INSERT from Group 1 and one from Group 2 you will get two difference Checksums. But then un-comment yet another pair and you'll get another match!
In the end I have two questions. I would love to learn a bit more about how this works and why this pattern seems to influence a pretty predictable checksum value. And even more importantly I would like to know if there's a better way to essentially create a "fingerprint" of a very large set of data. I understand that I cannot guarantee this hash to be globally unique but I apparently need something better than Checksum.
One way I was able to sort of trick the Checksum calculation was to perform a HASHBYTES on the Datetime beforehand feeding it to the Checksum function. This way the Checksum is fed values which are far more random looking than a set of dates with similar looking differences. But will it be enough?
Edit - Here's just a bit more context.
Basically I have one system which has a huge amount of schedule data and a separate system which is interested in these schedules at specific times. For example, multiple users may see a specific version of a portion this complex schedule and want to add some metadata (maybe their approval status, notes or something). If some external source makes a change to any single datetime then this link needs to be broken because it isn't the same schedule anymore!
There are many different systems which can introduce changes to the core schedule data which is why I'm having a hard time bubbling up this concern to the code level to somehow manage and "normalize" this into entities which represent each snapshot in some way. I would have to have hooks in a million places listening for changes and then cleaning up anything pointing to the schedule.