Python list.clear() time and space complexity?

2020-06-17 16:18发布

问题:

I am writing a blogpost on Python list.clear() method where I also want to mention about the time and space complexity of the underlying algorithm. I expected the time complexity to be O(N), iterate over the elements and free the memory? But, I found an article where it is mentioned that it is actually an O(1) operation. Then, I searched the source code of the method in CPython implementation and found a method which I believe is the actual internal implementation of list.clear(), however, I am not really sure it is. Here's the source code of the method:

static int
_list_clear(PyListObject *a)
{
    Py_ssize_t i;
    PyObject **item = a->ob_item;
    if (item != NULL) {
         /* Because XDECREF can recursively invoke operations on
           this list, we make it empty first. */
        i = Py_SIZE(a);
        Py_SIZE(a) = 0;
        a->ob_item = NULL;
        a->allocated = 0;
        while (--i >= 0) {
           Py_XDECREF(item[i]);
        }
        PyMem_FREE(item);
    }
    /* Never fails; the return value can be ignored.
       Note that there is no guarantee that the list is actually empty
       at this point, because XDECREF may have populated it again! */
    return 0;
}

I could be wrong but it does look like O(N) to me. Also, I found a similar question here, but there's no clear answer there. Just want to confirm the actual time and space complexity of list.clear(), and maybe a little explanation supporting the answer. Any help appreciated. Thanks.

回答1:

As you correctly noticed, the CPython implementation of list.clear is O(n). The code iterates over the elements in order to decrease the reference count of each one, without a way to avoid it. There is no doubt that it is an O(n) operation and, given a large enough list, you can measure the time spent in clear() as function of list size:

import time

for size in 1_000_000, 10_000_000, 100_000_000, 1_000_000_000:
    l = [None] * size
    t0 = time.time()
    l.clear()
    t1 = time.time()
    print(size, t1 - t0)

The output shows linear complexity; on my system with Python 3.7 it prints the following:

1000000 0.0023756027221679688
10000000 0.02452826499938965
100000000 0.23625731468200684
1000000000 2.31496524810791

The time per element is of course tiny because the loop is coded in C and each iteration does very little work. But, as the above measurement shows, even a tiny per-element factor eventually adds up. Small per-element constant is not the reason to ignore the cost of an operation, or the same would apply to the loop that shifts the list elements in l.insert(0, ...), which is also very efficient - and yet few would claim insertion at the beginning to be O(1). (And clear potentially does more work because a decref will run an arbitrary chain of destructors for an object whose reference count actually reaches zero.)

On a philosophical level, one could argue that costs of memory management should be ignored when assessing complexity because otherwise it would be impossible to analyze anything with certainty, as any operation could trigger a GC. This argument has merit; GC does come occasionally and unpredictably, and its cost can be considered amortized across all allocations. In a similar vein complexity analysis tends to ignore the complexity of malloc because the parameters it depends on (like memory fragmentation) are typically not directly related to allocation size or even to the number of already allocated blocks. However, in case of list.clear there is only one allocated block, no GC is triggered, and the code is still visiting each and every list element. Even with the assumption of O(1) malloc and amortized O(1) GC, list.clear still takes the time proportional to the number of elements in the list.

The article linked from the question is about Python the language and doesn't mention a particular implementation. Python implementations that don't use reference counting, such as Jython or PyPy, are likely to have true O(1) list.clear, and for them the claim from the article would be entirely correct. So, when explaining the Python list on a conceptual level, it is not wrong to say that clearing the list is O(1) - after all, all the object references are in a contiguous array, and you free it only once. This is the point your blog post probably should make, and that is what the linked article is trying to say. Taking the cost of reference counting into account too early might confuse your readers and give them completely wrong ideas about Python's lists (e.g. they could imagine that they are implemented as linked lists).

Finally, at some point one must accept that memory management strategy does change complexity of some operations. For example, destroying a linked list in C++ is O(n) from the perspective of the caller; discarding it in Java or Go would be O(1). And not in the trivial sense of a garbage-collected language is just postponing the same work for later - it is quite possible that a moving collector will only traverse reachable objects and will indeed never visit the elements of the discarded linked list. Reference counting makes discarding large containers algorithmically similar to manual collection, and GC can remove that. While CPython's list.clear has to touch every element to avoid a memory leak, it is quite possible that PyPy's garbage collector never needs to do anything of the sort, and thus has a true O(1) list.clear.



回答2:

It's O(1) neglecting memory management. It's not quite right to say it's O(N) accounting for memory management, because accounting for memory management is complicated.

Most of the time, for most purposes, we treat the costs of memory management separately from the costs of the operations that triggered it. Otherwise, just about everything you could possibly do becomes O(who even knows), because almost any operation could trigger a garbage collection pass or an expensive destructor or something. Heck, even in languages like C with "manual" memory management, there's no guarantee that any particular malloc or free call will be fast.

There's an argument to be made that refcounting operations should be treated differently. After all, list.clear explicitly performs a number of Py_XDECREF operations equal to the list's length, and even if no objects are deallocated or finalized as a result, the refcounting itself will necessarily take time proportional to the length of the list.

If you count the Py_XDECREF operations list.clear performs explicitly, but ignore any destructors or other code that might be triggered by the refcounting operations, and you assume PyMem_FREE is constant time, then list.clear is O(N), where N is the original length of the list. If you discount all memory management overhead, including the explicit Py_XDECREF operations, list.clear is O(1). If you count all memory management costs, then the runtime of list.clear cannot be asymptotically bounded by any function of the list's length.



回答3:

As the other answers have noted, it takes O(n) time to clear a list of length n. But I think there is an additional point to be made about amortized complexity here.

If you start with an empty list, and do N append or clear operations in any order, then the total running time across all of those operations is always O(N), giving an average per operation of O(1), however long the list gets in the process, and however many of those operations are clear.

Like clear, the worst case for append is also O(n) time where n is the length of the list. That's because when the capacity of the underlying array needs to be increased, we have to allocate a new array and copy everything across. But the cost of copying each element can be "charged" to one of the append operations which got the list to a length where the array needs to be resized, in such a way that N append operations starting from an empty list always take O(N) time.

Likewise, the cost of decrementing an element's refcount in the clear method can be "charged" to the append operation which inserted that element in the first place, because each element can only get cleared once. The conclusion is that if you are using a list as an internal data structure in your algorithm, and your algorithm repeatedly clears that list inside a loop, then for the purpose of analysing your algorithm's time complexity you should count clear on that list as an O(1) operation, just as you'd count append as an O(1) operation in the same circumstances.



回答4:

A quick time check indicates that it is O(n).

Let's execute the following and create the lists beforehand to avoid overhead:

import time
import random

list_1000000 = [random.randint(0,10) for i in range(1000000)]
list_10000000 = [random.randint(0,10) for i in range(10000000)]
list_100000000 = [random.randint(0,10) for i in range(100000000)]

Now check for the time it takes to clear these four lists of different sizes as follows:

start = time.time()
list.clear(my_list)
end = time.time()
print(end - start))

Results:

list.clear(list_1000000) takes 0.015
list.clear(list_10000000) takes 0.074
list.clear(list_100000000) takes 0.64

A more robust time measurement is needed, since these numbers can deviate each time it is ran, but the results indicate that the execution time goes pretty much linearly as the input size grows. Hence we can conclude an O(n) complexity.