Shortest hash in python to name cache files-第2页回答

What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).

So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.

Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?

Edit: Use case is like this: The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).

Conclusions (?)

The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.

I found out how to take a hex hash (from any hash source) and store it compactly with base64:

# 'h' is a string of hex digits 
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")

标签： python hash

8条回答

兄弟一词,经得起流年.

2楼-- · 2019-01-31 10:00

We use hashlib.sha1.hexdigest(), which produces even longer strings, for cache objects with good success. Nobody is actually looking at cache files anyway.

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2019-01-31 10:04

If you do have a collision, how are you going to tell that it actually happened?

If I were you, I would use hashlib to sha1() the repr(), and then just get a limited substring of it (first 16 characters, for example).

Unless you are talking about huge numbers of these objects, I would suggest that you just use the full hash. Then the opportunity for collision is so, so, so, so small, that you will never live to see it happen (likely).

Also, if you are dealing with that many files, I'm guessing that your caching technique should be adjusted to accommodate it.

0人赞添加讨论(0) 举报

上一页 1 2

Shortest hash in python to name cache files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间