I want a hash function which is fast, collision resistant and can give unique output. The primary requirement is - it should be persist-able i.e It's progress(hashing progress) could be saved on a file and then later resumed. You can also provide your own implementation with Python.
Implementations in "other languages" is/are also accepted if it is possible to use that with Python without getting hands dirty going internal.
Thanks in advance :)
Because of the pigeonhole principle no hash function can generate hashes which are unique / collision-proof. A good hashing function is collision-resistant, and makes it difficult to generate a file that produces a specified hash. Designing a good hash function is an advanced topic, and I'm certainly no expert in that field. However, since my code is based on sha256 it should be fairly collision-resistant, and hopefully it's also difficult to generate a file that produces a specified hash, but I can make no guarantees in that regard.
Here's a resumable hash function based on sha256 which is fairly fast. It takes about 44 seconds to hash a 1.4GB file on my 2GHz machine with 2GB of RAM.
#! /usr/bin/env python
''' Use SHA-256 to make a resumable hash function
The file is divided into fixed-sized chunks, which are hashed separately.
The hash of each chunk is combined into a hash for the whole file.
The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
When a signal is received, hashing continues until the end of the
current chunk, then the file position and current hex digest is saved
to a file. The name of this file is formed by appending '.hash' to the
name of the file being hashed.
Just re-run the program to resume hashing. The '.hash' file will be deleted
once hashing is completed.
Written by PM 2Ring 2014.11.11
import sys
import os
import hashlib
import signal
quit = False
blocksize = 1<<16 # 64kB
blocksperchunk = 1<<10
chunksize = blocksize * blocksperchunk
def handler(signum, frame):
global quit
print "\nGot signal %d, cleaning up." % signum
quit = True
def do_hash(fname):
hashname = fname + '.hash'
if os.path.exists(hashname):
with open(hashname, 'rt') as f:
data = f.read().split()
pos = int(data[0])
current = data[1].decode('hex')
pos = 0
current = ''
finished = False
with open(fname, 'rb') as f:
while not (quit or finished):
full = hashlib.sha256(current)
part = hashlib.sha256()
for _ in xrange(blocksperchunk):
block = f.read(blocksize)
if block == '':
finished = True
current = full.digest()
pos += chunksize
print pos
if finished or quit:
hexdigest = full.hexdigest()
if quit:
with open(hashname, 'wt') as f:
f.write("%d %s\n" % (pos, hexdigest))
elif os.path.exists(hashname):
return (not quit), pos, hexdigest
def main():
if len(sys.argv) != 2:
print "Calculate resumable hash of a file."
print "Usage:\npython %s filename\n" % sys.argv[0]
fname = sys.argv[1]
signal.signal(signal.SIGINT, handler)
signal.signal(signal.SIGTERM, handler)
print do_hash(fname)
if __name__ == '__main__':