I want a hash function which is fast, collision resistant and can give unique output. The primary requirement is - it should be persist-able i.e It's progress(hashing progress) could be saved on a file and then later resumed. You can also provide your own implementation with Python.
Implementations in "other languages" is/are also accepted if it is possible to use that with Python without getting hands dirty going internal.
Thanks in advance :)
Because of the pigeonhole principle no hash function can generate hashes which are unique / collision-proof. A good hashing function is collision-resistant, and makes it difficult to generate a file that produces a specified hash. Designing a good hash function is an advanced topic, and I'm certainly no expert in that field. However, since my code is based on sha256 it should be fairly collision-resistant, and hopefully it's also difficult to generate a file that produces a specified hash, but I can make no guarantees in that regard.
Here's a resumable hash function based on sha256 which is fairly fast. It takes about 44 seconds to hash a 1.4GB file on my 2GHz machine with 2GB of RAM.
persistent_hash.py
#! /usr/bin/env python
''' Use SHA-256 to make a resumable hash function
The file is divided into fixed-sized chunks, which are hashed separately.
The hash of each chunk is combined into a hash for the whole file.
The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
When a signal is received, hashing continues until the end of the
current chunk, then the file position and current hex digest is saved
to a file. The name of this file is formed by appending '.hash' to the
name of the file being hashed.
Just re-run the program to resume hashing. The '.hash' file will be deleted
once hashing is completed.
Written by PM 2Ring 2014.11.11
'''
import sys
import os
import hashlib
import signal
quit = False
blocksize = 1<<16 # 64kB
blocksperchunk = 1<<10
chunksize = blocksize * blocksperchunk
def handler(signum, frame):
global quit
print "\nGot signal %d, cleaning up." % signum
quit = True
def do_hash(fname):
hashname = fname + '.hash'
if os.path.exists(hashname):
with open(hashname, 'rt') as f:
data = f.read().split()
pos = int(data[0])
current = data[1].decode('hex')
else:
pos = 0
current = ''
finished = False
with open(fname, 'rb') as f:
f.seek(pos)
while not (quit or finished):
full = hashlib.sha256(current)
part = hashlib.sha256()
for _ in xrange(blocksperchunk):
block = f.read(blocksize)
if block == '':
finished = True
break
part.update(block)
full.update(part.digest())
current = full.digest()
pos += chunksize
print pos
if finished or quit:
break
hexdigest = full.hexdigest()
if quit:
with open(hashname, 'wt') as f:
f.write("%d %s\n" % (pos, hexdigest))
elif os.path.exists(hashname):
os.remove(hashname)
return (not quit), pos, hexdigest
def main():
if len(sys.argv) != 2:
print "Calculate resumable hash of a file."
print "Usage:\npython %s filename\n" % sys.argv[0]
exit(1)
fname = sys.argv[1]
signal.signal(signal.SIGINT, handler)
signal.signal(signal.SIGTERM, handler)
print do_hash(fname)
if __name__ == '__main__':
main()