-->

How to detect whether two files are identical in P

2020-07-09 07:44发布

问题:

Is making system call to "md5sum file1" and "md5sum file2" and compare two return values enough in this case?

回答1:

Well, that will tell you whether they're definitely different or probably the same. It's possible for two files to have the same hash but not actually have the same data... just very unlikely.

In your situation, what is the impact if you get a false positive (i.e. if you think they're the same, but they're not)? MD5 is probably good enough not to worry about collisions if they would only occur accidentally... but if you've got security (or money) at stake and someone could plant a "bad" file with the same hash as a "good" file, you shouldn't rely on it.

Personally, I'd probably just read both files, comparing each byte - for a one off comparison, both the hashing and this approach will require reading the whole file when they're equal; as Daniel points out in the comments, doing a byte-by-byte comparison lets you exit early as soon as you see a difference. Comparing the file sizes first is another quick optimization :)

The general advantage of hashing occurs when you store the hash of the existing file somewhere, so that next time you can just read the new file.



回答2:

If you want to do more than just detect if they differ, or don't trust the hashing solution, there are modules called difflib and filecmp that doesn't rely on external programs.



回答3:

Of course there is a simple test that you should do before comparing the file content at all - if the files are different sizes, then they can not possibly be the same.

Wouldn't it be more efficient to simply read each file and do a byte-by-byte comparison, avoiding the hashing algorithm altogether. This avoids the the (very unlikely) chance that two different files produce the same MD5 hash. Furthermore, you can bail out of the comparison when the first difference is detected, which for very different files will be very early in the comparison (possible on the first byte!)



回答4:

A hash is useful if you are going to cache it (to compare many different files with each-other). If you just want to compare two files, it's a monstrous waste of cycles. After all - both files will be read in, and a lot of processing will be used on every bite.

If it's a 1:1 compare, just use:

import filecmp
filecmp.cmp(file_name_1,file_name_2)

On the other hand, a good hash is the only way to compare a large number of files with each-other.

SHA-1 and MD5 sort of broken - but not for normal files. A few researchers can generate 2 nonsense files that might clash, but it's unlikely that anyone can clobber an existing file.

git uses SHA-1 to compare text, so it's not a terrible choice.

The following will all work:

import hashlib
hash = hashlib.MD5(your_text_here).hexdigest() # safe*
hash = hashlib.SHA1(your_text_here).hexdigest() # safe*
hash = hashlib.SHA224(your_text_here).hexdigest() # safe
hash = hashlib.SHA512(your_text_here).hexdigest() # paranoid

# now put the hash in a dictionary (or database) for your many-to-many comparison.

#  * Meaningful files will not be clobbered. Contrived files can be generated
#    which might clash together, but it's difficult to do.


回答5:

If you're on a system with md5sum, that's probably good enough.

You can do it with Python standard libraries -- checkout out hashlib.



回答6:

Depends if you feel comfortable with the probability of collision on the MD5 algorithm. Just note it is highly unlikely: so yes, go ahead.



回答7:

If there is nobody maliciously trying to create collisions, then you would have to compare about 264 files before you would expect to see a collision by random chance. However, it is possible for someone to carefully construct two files with the same MD5 sum due to cryptographic weaknesses in MD5. Whether the cryptographic weaknesses of MD5 matter or not depends on your application, where the files come from, and what an attacker could stand to gain if he tricked your program into thinking two different files were identical. MD5 is still a very good checksum, just not so great as a cryptographic hash.



回答8:

yes, it is enough