A lot of files will be stored in the DB and I need file hashes to unique identify that the file was not changed. (In general, will be used as the Windows Personal Firewall part)
问题:
回答1:
This is, of course, not possible in general. Many people still use hashing for this purpose, and MD5 is a popular algorithm, that gives you a 128-bit "signature" for the file with a high probability of changing when the contents of the file changes.
In the general case, you need to look at every bit of the file to include it in the hash, and performance will probably be I/O-limited. It's a sequential sweep over all data in the file, updating the state of whatever hash algorithm you use for each new byte. On a modern CPU, the latter will be quicker than the former. This rather old analysis shows around ~45 MB/s on a Pentium 90 MHz CPU.
回答2:
If I understand the "used as the Windows Personal Firewall" part right, MD5 is not a good choice as an algorithm.
There exists a successful attack on the MD5 algorithm which lets you find a different message that produces the same hash with relatively little work (as compared to brute force). That attack used to have no real bearing, e.g. when MD5 was used to hash passwords or such. In the mean time, new attacks have been found, so both MD5 and SHA-1 can be hashed/collided at scary speeds, and cracking entire databases of "properly salted" and single-hashed user passwords with these "elderly" hashes is not only entirely feasible but has already been demonstrated.
However, in the particular application of "make sure this file has not been tampered", this kind of attack has always been an issue, not just recently. MD5 will quite safely detect a bit error or accidential modification, but a malware trying to bypass your personal filewall, could rather trivially circumvent your entire security by finding a collision for the infected binary so the hash matches the original.
You should use SHA-256 for this case [Update: in the mean time, SHA-3 is out, and while I personally don't agree with NIST's choice of a winner (or the obscure criteria for ruling out some very good round 2 candidates), it is a much safer choice to use SHA-3 (Keccak) or alternatively one of the SHA-3 finalists. All of the finalists have been carefully designed by experienced teams, have been very thoroughly analyzed, and so far none has a realistic attack or a known issue that could conceivably lead to a realistic attack, and they all have "more bits", too (which by itself doesn't mean much, but more bits don't hurt)].
Also, remember to always save the file's length in addition to a hash, this considerably hardens even a poor hash at neglegible cost. If you can, calculate two different hashes. It is much easier for an attacker to find some message that produces a collision on one hash than to find a message that produces a collision and that has exactly the same length, or even a message that collides on two different hashes and has the same length.
Since bandwidth (both disk and memory) is a non-neglegible factor in calculating a hash, it is even possible that calculating a single hash or two hashes simultaneously runs at comparable speed.
I've observed such an effect when calculating a CRC and encrypting the same blocks with a block cipher afterwards. Whether or not the CRC was calculated made less than 1% difference in overall runtime, so it was basically a free operation.
If you think that you have a strong reason for not using a well-known standard hash (performance constraints?), you could build your own secure hash. Using the Merkle–Damgård construct (or more recently, HAIFA), you can turn any secure block cipher into a secure hash function. For example, encrypt every input block with AES using a fixed key and xor the output to the next block before encrypting that one too. The output after the last block is your hash value.
While "build your own" is usually not a good idea, there may indeed be valid reasons in this case, since AES is fast and supported in hardware in the most recent processors. On my machine, AES runs at roughly 130MB/s. On an i7 (which has hardware support) it's reported around 570MB/s on the internet.
As for being I/O limited, unwind is right, disk can very well be the limiting factor, though it needs not be. Memory mapping is your friend, especially in your particular case.
If you check files that apply for rights on the firewall, then those will be executables that have been loaded into RAM (how could it be any different, they're being executed after all!). So, mapping the pages that are already in RAM will be merely adding a page table entry, more or less a no-op. And even if data is not in RAM, the performance (and ease) of memory mapping is outright stunning, I rarely use anything else these days when speed is of any concern.