I have 7 files that I'm generating MD5 hashes for. The hashes are used to ensure that a remote copy of the data store is identical to the local copy. Unfortunately, the link between these two copies of the data is mind numbingly slow. Changes to the data are very rare but I have a requirement that the data be synchronized at all times (or as soon as possible). Rather than passing 7 different MD5 hashes across my (extremely slow) communications link, I'd like to generate the hash for each file and then combine these hashes into a single hash which I can then transfer and then re-calculate/use for comparison on the remote side. If the "combined hash" differs, then I'd start sending the 7 individual hashes to determine exactly which file(s) have been changed. For example, here are the MD5 hashes for the 7 files as of last week:
0709d609d69385255c496436eb50402c
709465a74411bd596595c7b9b158ae6a
4ab657320ef33e3d5eb498e4c13d41b7
3b49c6ab199994fd776bb63761414e72
0fc28c5a010fc3c06c0c930c88e31a15
c4ecd214662cac5aae0e53f6f252bf0e
8b086431e43148a2c2d943ba30d31cc6
I'd like to combine these hashes together such that I get a single unique value (perhaps another MD5 hash?) that I can then send to the remote system. On the remote system, I'd then perform the same calculation to determine if the data as a whole has been changed. If it has, then I'd start sending the individual hashes, etc. The most important factor is that my "combined hash" be short enough so that it uses less bandwidth than just sending all 7 hashes in the first place. I thought of writing the 7 MD5 hashes to a file and then hashing that file but is there a better way?
Why don't you:
- Generate the 7 MD5 hashes (which is what you are doing now), and then
- Combine these 7 hash outputs into a larger byte array and MD5 hash that to produce an overall hash. (Each MD5 hash is 16 bytes, so you will end up with a 112 byte array which you will hash to get the overall hash).
If your overall hash matches with the other end, then nothing needs to be done. If not, then you start to send over your intermediate 7 hashes to work out which file(s) have changed.
You could just calculate a hash of the contents of all seven files concatenated together.
However, I don't recommend that, because you will open yourself up to subtle bugs, like:
file1: 01 02 03 04 file2: 05 06 07 08
will hash the same as
file1: 01 02 file2: 03 04 05 06 07 08
How slow is your comm link? a single MD5 hash is 32 bytes.
7 of them is less than 1/4 KB; that's just not much data.
On what side of the link are the files going to change?
You could cache a set of MD5s on that side, and then compare the files to the cached hashes on a regular-basis, and then kick off a transfer when you notice a difference.
Another option is to generate a single hash in the first place - see https://stackoverflow.com/a/15683147/188926
This example iterates all files in a folder, but you could iterate over your list of files instead.
XOR
them all.
As I know it's the most simple and effective solution.
I know this is out of left field, but you could simply check the Archive attribute on all of the files and if any of the files has this flag set, then the file has changed in some way.
You can then proceed to create a hash, if not, don't even bother generating a hash in the first place.
If the archive attribute is set, generate a hash, sync files and un-set the archive attribute.
That would be my suggested solution.