Shortest hash in python to name cache files

2019-01-31 09:16发布

What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).

So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.

Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?

Edit: Use case is like this: The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).

Conclusions (?)

The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.

I found out how to take a hex hash (from any hash source) and store it compactly with base64:

# 'h' is a string of hex digits 
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")

标签: python hash
8条回答
兄弟一词,经得起流年.
2楼-- · 2019-01-31 10:00

We use hashlib.sha1.hexdigest(), which produces even longer strings, for cache objects with good success. Nobody is actually looking at cache files anyway.

查看更多
我只想做你的唯一
3楼-- · 2019-01-31 10:04

If you do have a collision, how are you going to tell that it actually happened?

If I were you, I would use hashlib to sha1() the repr(), and then just get a limited substring of it (first 16 characters, for example).

Unless you are talking about huge numbers of these objects, I would suggest that you just use the full hash. Then the opportunity for collision is so, so, so, so small, that you will never live to see it happen (likely).

Also, if you are dealing with that many files, I'm guessing that your caching technique should be adjusted to accommodate it.

查看更多
登录 后发表回答