Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven't been able to find any sort of definitive answer.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).
dict
uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).O(1)
lookup by index).The figure below is a logical representation of a Python hash table. In the figure below,
0, 1, ..., i, ...
on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
i
, that is based on the hash of the key. CPython initially usesi = hash(key) & mask
(wheremask = PyDictMINSIZE - 1
, but that's not really important). Just note that the initial slot,i
, that is checked depends on the hash of the key.<hash|key|value>
). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)==
comparison not theis
comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.i+1, i+2, ...
and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.dict
will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.
Here's the short course:
The ordered aspect is unofficial as of Python 3.6, but official in Python 3.7.
Python's Dictionaries are Hash Tables
For a long time, it worked exactly like this. Python would preallocate 8 empty rows and use the hash to determine where to stick the key-value pair. For example, if the hash for the key ended in 001, it would stick it in the 1 index (like the example below.)
Each row takes up 24 bytes on a 64 bit architecture, 12 on a 32 bit. (Note that the column headers are just labels - they don't actually exist in memory.)
If the hash ended the same as a preexisting key's hash, this is a collision, and then it would stick the key-value pair in a different location.
After 5 key-values are stored, when adding another key-value pair, the probability of hash collisions is too large, so the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows.
This takes a lot of space, but the lookup time is fairly constant. The key comparison algorithm is to compute the hash, go to the expected location, compare the key's id - if they're the same object, they're equal. If not then compare the hash values, if they are not the same, they're not equal. Else, then we finally compare keys for equality, and if they are equal, return the value. The final comparison for equality can be quite slow, but the earlier checks usually shortcut the final comparison, making the lookups very quick.
(Collisions slow things down, and an attacker could theoretically use hash collisions to perform a denial of service attack, so we randomized the hash function such that it computes a different hash for each new Python process.)
The wasted space described above has led us to modify the implementation of dictionaries, with an exciting new (if unofficial) feature that dictionaries are now ordered (by insertion).
The New Compact Hash Tables
We start, instead, by preallocating an array for the index of the insertion.
Since our first key-value pair goes in the second slot, we index like this:
And our table just gets populated by insertion order:
So when we do a lookup for a key, we use the hash to check the position we expect (in this case, we go straight to index 1 of the array), then go to that index in the hash-table (e.g. index 0), check that the keys are equal (using the same algorithm described earlier), and if so, return the value.
We retain constant lookup time, with minor speed losses in some cases and gains in others, with the upside that we save quite a lot of space over the pre-existing implementation. The only space wasted are the null bytes in the index array.
Raymond Hettinger introduced this to python-dev in December of 2012. It finally got into CPython in Python 3.6. Ordering by insertion is still considered an implementation detail to allow other implementations of Python a chance to catch up.
Shared Keys
Another optimization to save space is an implementation that shares keys. Thus, instead of having redundant dictionaries that take up all of that space, we have dictionaries that reuse the shared keys and keys' hashes. You can think of it like this:
For a 64 bit machine, this could save up to 16 bytes per key per extra dictionary.
Shared Keys for Custom Objects & Alternatives
These shared-key dicts are intended to be used for custom objects'
__dict__
. To get this behavior, I believe you need to finish populating your__dict__
before you instantiate your next object (see PEP 412). This means you should assign all your attributes in the__init__
or__new__
, else you might not get your space savings.However, if you know all of your attributes at the time your
__init__
is executed, you could also provide__slots__
for your object, and guarantee that__dict__
is not created at all (if not available in parents), or even allow__dict__
but guarantee that your foreseen attributes are stored in slots anyways. For more on__slots__
, see my answer here.See also:
**kwargs
in a function.Python Dictionaries use Open addressing (reference inside Beautiful code)
NB! Open addressing, a.k.a closed hashing should, as noted in Wikipedia, not be confused with its opposite open hashing!
Open addressing means that the dict uses array slots, and when an object's primary position is taken in the dict, the object's spot is sought at a different index in the same array, using a "perturbation" scheme, where the object's hash value plays part.