-->

Identifying Differences Efficiently

2020-07-18 08:23发布

问题:

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.

The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.

Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..

回答1:

There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.

One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.

Primary key match + hash match = Unchanged row

Primary key match + hash mismatch = Updated row

Primary key in incoming data but missing from existing data set = New row

Primary key not in incoming data but in existing data set = Deleted row

How to hash varies by database product, but all of the major providers have some sort of hashing available in them.

The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.

It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.



回答2:

The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.

Both rsync and git are examples of widely used software that use hashes in this way.

In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.

There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.