Tracking unique versions of files with hashes

2019-07-06 13:01发布

I'm going to be tracking different versions of potentially millions of different files, and my intent is to hash them to determine I've already seen that particular version of the file. Currently, I'm only using MD5 (the product is still in development, so it's never dealt with millions of files yet), which is clearly not long enough to avoid collisions.

However, here's my question - Am I more likely to avoid collisions if I hash the file using two different methods and store both hashes (say, SHA1 and MD5), or if I pick a single, longer hash (like SHA256) and rely on that alone? I know option 1 has 288 hash bits and option 2 has only 256, but assume my two choices are the same total hash length.

Since I'm dealing with potentially millions of files (and multiple versions of those files over time), I'd like to do what I can to avoid collisions. However, CPU time isn't (completely) free, so I'm interested in how the community feels about the tradeoff - is adding more bits to my hash proportionally more expensive to compute, and are there any advantages to multiple different hashes as opposed to a single, longer hash, given an equal number of bits in both solutions?

2条回答
The star\"
2楼-- · 2019-07-06 13:42

For file version tracking, I would think collisions between different files is not a concern. For each file, you are using a hash to determine if that and only that file changed. Whether the hash for that file collides with a different file is irrelevant isn't it?

EDIT: You are applying a hash as an optimization to avoid comparing each new file to millions of existing files. Collisions are not a reason to avoid using a fast hash. Simply deal with the collision case (if it ever happens) by storing the new version of the file anyway. Both hashing schemes will provide the optimization. Why overoptimize for something that probably won't happen. What if you had a super-fast hash that would collide 1 in 1000000. That would not be good for cryptography, but would be fine for versioning control.

Even when using GUIDs, systems detect collisions and handle them. The system need not be optimized for something that will statistically never happen.

查看更多
一纸荒年 Trace。
3楼-- · 2019-07-06 13:42

I've thought about and toyed with this problem a great deal, and I recommend using SHA256 to stay on the safe side (it's slower, but the CPU should still manage to keep up). I don't know if this significantly weakens the hash strength, but you may want to parcel up the hashes among 16MB blocks (for example), then hash the hashes at the end so you can parallelize.

One lesson I've learned toying with massive numbers of files and hashing is this: adding millions of records to a PostgreSQL database in one sitting isn't very fast. When I wrote a program to hash one million files and store them in a PostgreSQL database, the database was often the bottleneck. I didn't try MySQL, but I speculate it's roughly the same. SQLite is probably much faster since there's no client/server overhead. I recommend trying SQLite first. It might be too slow too.

Also, if you store a million files by hash into a directory and lose the index file, it's kind of hard to find things :)

查看更多
登录 后发表回答