Are there any reasons not to use an OrderedDict?

2019-01-07 07:45发布

问题:

I'm referring to the OrderedDict from the collections module, which is an ordered dictionary.

If it has the added functionality of being orderable, which I realize may often not be necessary but even so, are there any downsides? Is it slower? Is it missing any functionality? I didn't see any missing methods.

In short, why shouldn't I always use this instead of a normal dictionary?

回答1:

OrderedDict is a subclass of dict, and needs more memory to keep track of the order in which keys are added. This isn't trivial. The implementation adds a second dict under the covers, and a doubly-linked list of all the keys (that's the part that remembers the order), and a bunch of weakref proxies. It's not a lot slower, but at least doubles the memory over using a plain dict.

But if it's appropriate, use it! That's why it's there :-)

How it works

The base dict is just an ordinary dict mapping keys to values - it's not "ordered" at all. When a <key, value> pair is added, the key is appended to a list. The list is the part that remembers the order.

But if this were a Python list, deleting a key would take O(n) time twice over: O(n) time to find the key in the list, and O(n) time to remove the key from the list.

So it's a doubly-linked list instead. That makes deleting a key constant (O(1)) time. But we still need to find the doubly-linked list node belonging to the key. To make that operation O(1) time too, a second - hidden - dict maps keys to nodes in the doubly-linked list.

So adding a new <key, value> pair requires adding the pair to the base dict, creating a new doubly-linked list node to hold the key, appending that new node to the doubly-linked list, and mapping the key to that new node in the hidden dict. A bit over twice as much work, but still O(1) (expected case) time overall.

Similarly, deleting a key that's present is also a bit over twice as much work but O(1) expected time overall: use the hidden dict to find the key's doubly-linked list node, delete that node from the list, and remove the key from both dicts.

Etc. It's quite efficient.



回答2:

multithreading

if your dictionary is accessed from multiple threads without a lock, especially as a synchronisation point.

vanilla dict operations are atomic, and any type extended in Python is not.

In fact, I'm not even certain OrderedDict is thread-safe (without a lock), although I cannot discount the possibility that it was very carefully coded and satisfies definition of reentrancy.

lesser devils

memory usage if you create tons of these dictionaries

cpu usage if all your code does is munge these dictionaries



回答3:

why shouldn't I always use this instead of a normal dictionary

In Python 2.7, normal OrderedDict usage will create reference cycles. So any use of OrderedDict requires the garbage collector to be enabled in order to free the memory. Yes, the garbage collector is on by default in cPython, but disabling it has its uses.

e.g. With cPython 2.7.14

from __future__ import print_function

import collections
import gc

if __name__ == '__main__':
    d = collections.OrderedDict([('key', 'val')])
    gc.collect()
    del d
    gc.set_debug(gc.DEBUG_LEAK)
    gc.collect()
    for i, obj in enumerate(gc.garbage):
        print(i, obj)

outputs

gc: collectable <list 00000000033E7908>
gc: collectable <list 000000000331EC88>
0 [[[...], [...], 'key'], [[...], [...], 'key'], None]
1 [[[...], [...], None], [[...], [...], None], 'key']

Even if you just create an empty OrderedDict (d = collections.OrderedDict()) and don't add anything to it, or you explicitly try to clean it up by calling the clear method (d.clear() before del d), you will still get one self-referencing list:

gc: collectable <list 0000000003ABBA08>
0 [[...], [...], None]

This seems to have been the case since this commit removed the __del__ method in order to prevent the potential for OrderedDict to cause uncollectable cycles, which are arguably worse. As noted in the changelog for that commit:

Issue #9825: removed __del__ from the definition of collections.OrderedDict. This prevents user-created self-referencing ordered dictionaries from becoming permanently uncollectable GC garbage. The downside is that removing __del__ means that the internal doubly-linked list has to wait for GC collection rather than freeing memory immediately when the refcnt drops to zero.


Note that in Python 3, the fix for the same issue was made differently and uses weakref proxies to avoid cycles:

Issue #9825: Using __del__ in the definition of collections.OrderedDict made it possible for the user to create self-referencing ordered dictionaries which become permanently uncollectable GC garbage. Reinstated the Py3.1 approach of using weakref proxies so that reference cycles never get created in the first place.



回答4:

Since Python 3.7, all dictionaries are guaranteed to be ordered. The Python contributors determined that switching to making dict ordered would not have a negative performance impact. I don't know how the performance of OrderedDict compares to dict in Python >= 3.7, but I imagine they would be comparable since they are both ordered.

See also:

  • Will OrderedDict become redundant in Python 3.7?