I need to compare large chunks of data for equality, and I need to compare many per second, fast. Every object is guaranteed to be the same size, and it is possible/likely they may only be slightly different (in unknown positions).
I have seen, from the interactive session below, using ==
operator for byte strings can be slower if the differences are towards the end of the string, and it can be very fast if there is a difference near the start.
I thought there might be some way to speed things up using some sort of hash, of course computing the md5 hash and comparing is a fair whack slower, but python's inbuilt hash does seem to speed things up significantly.
However, I have no idea about the implementation details of this hash, is it really hash-like in that I can be comfortable that when hash(a) == hash(b)
then a == b
is very likely? I am happy to have a few incorrect results if a hash collision is reasonably rare (rare in the sense of needing an array of 200 PS3s several hours to make a collision)
In [1]: import hashlib
In [2]: with open('/dev/urandom') as f:
...: spam = f.read(2**20 - 1)
...:
In [3]: spamA = spam + 'A'
In [4]: Aspam = 'A' + spam
In [5]: spamB = spam + 'B'
In [6]: timeit spamA == spamB
1000 loops, best of 3: 1.59 ms per loop
In [7]: timeit spamA == Aspam
10000000 loops, best of 3: 66.4 ns per loop
In [8]: timeit hashlib.md5(spamA) == hashlib.md5(spamB)
100 loops, best of 3: 4.42 ms per loop
In [9]: timeit hashlib.md5(spamA) == hashlib.md5(Aspam)
100 loops, best of 3: 4.39 ms per loop
In [10]: timeit hash(spamA) == hash(spamB)
10000000 loops, best of 3: 157 ns per loop
In [11]: timeit hash(spamA) == hash(Aspam)
10000000 loops, best of 3: 160 ns per loop