The most effective way to assign unique integer id

2019-04-08 05:54发布

问题:

The program that I write processes a large number of objects, each with its own unique id, which itself is a string of complicated structure (dozen of unique fields of the object joined by some separator) and big length.

Since I have to process a lot of these objects fast and I need to reffer to them by id while processing and I have no power to change their format (I retrieve them externally, by network), I want to map their complicated string id to my own internal integer id and further use it for comparison, for transfering them further to other processes, etc.

What I'm going to do is to use a simple dict with keys as string id of the object and integer values as my internal integer id of it.

My question is: is there a better way in Python to do this? May be there is a way to calculate some hash manually, whatever? May be the dict is not the best solution?

As for numbers: there are about 100K of such unique objects in the system at a time, so the integer capacity is more than enough.

回答1:

For comparison purposes, you can intern the strings and then compare them with is instead of ==, which does a simple pointer comparison and should be as fast as (or faster than) comparing two integers:

>>> 'foo' * 100 is 'foo' * 100
False
>>> intern('foo' * 100) is intern('foo' * 100)
True

intern guarantees that id(intern(A)) == id(intern(B)) iff A == B. Be sure to intern any string as soon as it is input. Note that intern is called sys.intern in Python 3.x.

But when you have to pass these strings to other processes, your dict solution seems best. What I usually do in such situations is

str_to_id = {}
for s in strings:
    str_to_id.setdefault(s, len(str_to_id))

so the integer capacity is more than enough

Python integers are bigints, so that should never be a problem.



回答2:

How about the hash function?

In [130]: hash
Out[130]: <function hash>

In [131]: hash('foo')
Out[131]: -740391237

There is no need to store hashes (unless you want to): the point is that they are equal for objects that are value-equal (although the reverse may not be true - there are no doubt unequal strings or other objects that hash to the same value; such is the nature of hashing).

If you know the range of your keys (and you probably do), you could also use a perfect hash function generator. This is apparently one for python: http://ilan.schnell-web.net/prog/perfect-hash/

Perfect hashes guarantee that keys within the specified range have a bijective relationship with their hash value.



回答3:

You could use one of the hashlib algorithms to create a cryptographically sound digest of the long message, and then use this as dictionary keys. Example using SHA-256:

import hashlib
...
key = hashlib.sha256(longMessage).digest()

The chance of collisions is much smaller this way than by using hash(longMessage).

However, this could introduce a potentially big overhead. Unless memory usage is a big concern I would simply use the original strings as keys instead.



回答4:

I've used the following for this purpose:

>>> from collections import defaultdict
>>> d = defaultdict(lambda: len(d))
>>> d["cats"]
0
>>> d["cars"]
1
>>> d["cats"]
0


回答5:

If they are stored in memory, and you're comparing each string as an object rather than as text I would suggest using id(string) to get a unique integer. Alternatively, if you're storing them in a dict you could use a defaultdict with a set of matches and hash them:

>>> strings = 'a whole lot of strings which may share a hash'.split()
>>> storage = defaultdict(set)
>>> for s in strings:
...     storage[hash(s)].add(s)
>>> storage[hash('a')]
{'a', 'a'}

Exactly how you would implement this depends on how you're using them, but the basic idea should work. If you could post a specific example of what you're trying to do it might be easier to give a more detailed answer.



回答6:

dict is a fine solution. If you have a way of generating a unique ID based on the string ID, you could have that do double duty as hash function for a custom string class:

class ID_String(str):
    cached_hash = None
    def __hash__(self):
        # custom hash code here
        return custom_hash
    def ID(self):
        if self.cached_hash is None:
            self.cached_hash = self.__hash__()
        return self.cached_hash


标签: python hash