可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
The program that I write processes a large number of objects, each with its own unique id, which itself is a string of complicated structure (dozen of unique fields of the object joined by some separator) and big length.
Since I have to process a lot of these objects fast and I need to reffer to them by id while processing and I have no power to change their format (I retrieve them externally, by network), I want to map their complicated string id to my own internal integer id and further use it for comparison, for transfering them further to other processes, etc.
What I'm going to do is to use a simple dict with keys as string id of the object and integer values as my internal integer id of it.
My question is: is there a better way in Python to do this? May be there is a way to calculate some hash manually, whatever? May be the dict is not the best solution?
As for numbers: there are about 100K of such unique objects in the system at a time, so the integer capacity is more than enough.
回答1:
For comparison purposes, you can intern
the strings and then compare them with is
instead of ==
, which does a simple pointer comparison and should be as fast as (or faster than) comparing two integers:
>>> 'foo' * 100 is 'foo' * 100
False
>>> intern('foo' * 100) is intern('foo' * 100)
True
intern
guarantees that id(intern(A)) == id(intern(B))
iff A == B
. Be sure to intern
any string as soon as it is input. Note that intern
is called sys.intern
in Python 3.x.
But when you have to pass these strings to other processes, your dict
solution seems best. What I usually do in such situations is
str_to_id = {}
for s in strings:
str_to_id.setdefault(s, len(str_to_id))
so the integer capacity is more than enough
Python integers are bigints, so that should never be a problem.
回答2:
How about the hash
function?
In [130]: hash
Out[130]: <function hash>
In [131]: hash('foo')
Out[131]: -740391237
There is no need to store hashes (unless you want to): the point is that they are equal for objects that are value-equal (although the reverse may not be true - there are no doubt unequal strings or other objects that hash to the same value; such is the nature of hashing).
If you know the range of your keys (and you probably do), you could also use a perfect hash function generator. This is apparently one for python: http://ilan.schnell-web.net/prog/perfect-hash/
Perfect hashes guarantee that keys within the specified range have a bijective relationship with their hash value.
回答3:
You could use one of the hashlib algorithms to create a cryptographically sound digest of the long message, and then use this as dictionary keys. Example using SHA-256:
import hashlib
...
key = hashlib.sha256(longMessage).digest()
The chance of collisions is much smaller this way than by using hash(longMessage).
However, this could introduce a potentially big overhead. Unless memory usage is a big concern I would simply use the original strings as keys instead.
回答4:
I've used the following for this purpose:
>>> from collections import defaultdict
>>> d = defaultdict(lambda: len(d))
>>> d["cats"]
0
>>> d["cars"]
1
>>> d["cats"]
0
回答5:
If they are stored in memory, and you're comparing each string as an object rather than as text I would suggest using id(string)
to get a unique integer. Alternatively, if you're storing them in a dict you could use a defaultdict with a set of matches and hash them:
>>> strings = 'a whole lot of strings which may share a hash'.split()
>>> storage = defaultdict(set)
>>> for s in strings:
... storage[hash(s)].add(s)
>>> storage[hash('a')]
{'a', 'a'}
Exactly how you would implement this depends on how you're using them, but the basic idea should work. If you could post a specific example of what you're trying to do it might be easier to give a more detailed answer.
回答6:
dict
is a fine solution. If you have a way of generating a unique ID based on the string ID, you could have that do double duty as hash function for a custom string class:
class ID_String(str):
cached_hash = None
def __hash__(self):
# custom hash code here
return custom_hash
def ID(self):
if self.cached_hash is None:
self.cached_hash = self.__hash__()
return self.cached_hash