I'm referring to the OrderedDict from the collections
module, which is an ordered dictionary.
If it has the added functionality of being orderable, which I realize may often not be necessary but even so, are there any downsides? Is it slower? Is it missing any functionality? I didn't see any missing methods.
In short, why shouldn't I always use this instead of a normal dictionary?
OrderedDict
is a subclass of dict
, and needs more memory to keep track of the order in which keys are added. This isn't trivial. The implementation adds a second dict
under the covers, and a doubly-linked list of all the keys (that's the part that remembers the order), and a bunch of weakref proxies. It's not a lot slower, but at least doubles the memory over using a plain dict
.
But if it's appropriate, use it! That's why it's there :-)
How it works
The base dict is just an ordinary dict mapping keys to values - it's not "ordered" at all. When a <key, value>
pair is added, the key
is appended to a list. The list is the part that remembers the order.
But if this were a Python list, deleting a key would take O(n)
time twice over: O(n)
time to find the key in the list, and O(n)
time to remove the key from the list.
So it's a doubly-linked list instead. That makes deleting a key constant (O(1)
) time. But we still need to find the doubly-linked list node belonging to the key. To make that operation O(1)
time too, a second - hidden - dict maps keys to nodes in the doubly-linked list.
So adding a new <key, value>
pair requires adding the pair to the base dict, creating a new doubly-linked list node to hold the key, appending that new node to the doubly-linked list, and mapping the key to that new node in the hidden dict. A bit over twice as much work, but still O(1)
(expected case) time overall.
Similarly, deleting a key that's present is also a bit over twice as much work but O(1)
expected time overall: use the hidden dict to find the key's doubly-linked list node, delete that node from the list, and remove the key from both dicts.
Etc. It's quite efficient.
multithreading
if your dictionary is accessed from multiple threads without a lock, especially as a synchronisation point.
vanilla dict operations are atomic, and any type extended in Python is not.
In fact, I'm not even certain OrderedDict is thread-safe (without a lock), although
I cannot discount the possibility that it was very carefully coded and satisfies definition of reentrancy.
lesser devils
memory usage if you create tons of these dictionaries
cpu usage if all your code does is munge these dictionaries
why shouldn't I always use this instead of a normal dictionary
In Python 2.7, normal OrderedDict
usage will create reference cycles. So any use of OrderedDict
requires the garbage collector to be enabled in order to free the memory. Yes, the garbage collector is on by default in cPython, but disabling it has its uses.
e.g. With cPython 2.7.14
from __future__ import print_function
import collections
import gc
if __name__ == '__main__':
d = collections.OrderedDict([('key', 'val')])
gc.collect()
del d
gc.set_debug(gc.DEBUG_LEAK)
gc.collect()
for i, obj in enumerate(gc.garbage):
print(i, obj)
outputs
gc: collectable <list 00000000033E7908>
gc: collectable <list 000000000331EC88>
0 [[[...], [...], 'key'], [[...], [...], 'key'], None]
1 [[[...], [...], None], [[...], [...], None], 'key']
Even if you just create an empty OrderedDict
(d = collections.OrderedDict()
) and don't add anything to it, or you explicitly try to clean it up by calling the clear
method (d.clear()
before del d
), you will still get one self-referencing list:
gc: collectable <list 0000000003ABBA08>
0 [[...], [...], None]
This seems to have been the case since this commit removed the __del__
method in order to prevent the potential for OrderedDict
to cause uncollectable cycles, which are arguably worse. As noted in the changelog for that commit:
Issue #9825: removed __del__ from the definition of collections.OrderedDict.
This prevents user-created self-referencing ordered dictionaries from
becoming permanently uncollectable GC garbage. The downside is that
removing __del__ means that the internal doubly-linked list has to wait for
GC collection rather than freeing memory immediately when the refcnt drops
to zero.
Note that in Python 3, the fix for the same issue was made differently and uses weakref proxies to avoid cycles:
Issue #9825: Using __del__ in the definition of collections.OrderedDict made
it possible for the user to create self-referencing ordered dictionaries
which become permanently uncollectable GC garbage. Reinstated the Py3.1
approach of using weakref proxies so that reference cycles never get created
in the first place.
Since Python 3.7, all dictionaries are guaranteed to be ordered. The Python contributors determined that switching to making dict
ordered would not have a negative performance impact. I don't know how the performance of OrderedDict
compares to dict
in Python >= 3.7, but I imagine they would be comparable since they are both ordered.
See also:
- Will OrderedDict become redundant in Python 3.7?