I need to compare large chunks of data for equality, and I need to compare many per second, fast. Every object is guaranteed to be the same size, and it is possible/likely they may only be slightly different (in unknown positions).
I have seen, from the interactive session below, using ==
operator for byte strings can be slower if the differences are towards the end of the string, and it can be very fast if there is a difference near the start.
I thought there might be some way to speed things up using some sort of hash, of course computing the md5 hash and comparing is a fair whack slower, but python's inbuilt hash does seem to speed things up significantly.
However, I have no idea about the implementation details of this hash, is it really hash-like in that I can be comfortable that when hash(a) == hash(b)
then a == b
is very likely? I am happy to have a few incorrect results if a hash collision is reasonably rare (rare in the sense of needing an array of 200 PS3s several hours to make a collision)
In [1]: import hashlib
In [2]: with open('/dev/urandom') as f:
...: spam = f.read(2**20 - 1)
...:
In [3]: spamA = spam + 'A'
In [4]: Aspam = 'A' + spam
In [5]: spamB = spam + 'B'
In [6]: timeit spamA == spamB
1000 loops, best of 3: 1.59 ms per loop
In [7]: timeit spamA == Aspam
10000000 loops, best of 3: 66.4 ns per loop
In [8]: timeit hashlib.md5(spamA) == hashlib.md5(spamB)
100 loops, best of 3: 4.42 ms per loop
In [9]: timeit hashlib.md5(spamA) == hashlib.md5(Aspam)
100 loops, best of 3: 4.39 ms per loop
In [10]: timeit hash(spamA) == hash(spamB)
10000000 loops, best of 3: 157 ns per loop
In [11]: timeit hash(spamA) == hash(Aspam)
10000000 loops, best of 3: 160 ns per loop
Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of
hash
is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.
If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.
One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e.
struct.pack('!q', hash('abc'))
/hashlib.md5('abc').digest()
).As a side note:
is
is not equivalent to==
in Python. You mean==
.