Building a string through repeated string concatenation is an anti-pattern, but I'm still curious why its performance switches from linear to quadratic after string length exceeds approximately 10 ** 6:
# this will take time linear in n with the optimization
# and quadratic time without the optimization
import time
start = time.perf_counter()
s = ''
for i in range(n):
s += 'a'
total_time = time.perf_counter() - start
time_per_iteration = total_time / n
For example, on my machine (Windows 10, python 3.6.1):
- for
10 ** 4 < n < 10 ** 6
, thetime_per_iteration
is almost perfectly constant at 170±10 µs - for
10 ** 6 < n
, thetime_per_iteration
is almost perfectly linear, reaching 520 µs atn == 10 ** 7
.
Linear growth in time_per_iteration
is equivalent to quadratic growth in total_time
.
The linear complexity results from the optimization in the more recent CPython versions (2.4+) that reuse the original storage if no references remain to the original object. But I expected the linear performance to continue indefinitely rather than switch to quadratic at some point.
My question is based made on this comment. For some odd reason running
python -m timeit -s"s=''" "for i in range(10**7):s+='a'"
takes incredibly long time (much longer than quadratic), so I never got the actual timing results from timeit
. So instead, I used a simple loop as above to obtain performance numbers.
Update:
My question might as well have been titled "How can a list-like append
have O(1)
performance without over-allocation?". From observing constant time_per_iteration
on small-size strings, I assumed the string optimization must be over-allocating. But realloc
is (unexpectedly to me) quite successful at avoiding memory copy when extending small memory blocks.
TL;DR: Just because string concatenation is optimized under certain circumstances doesn't mean it's necessarily
O(1)
, it's just not alwaysO(n)
. What determines the performance is ultimatly your system and it could be smart (beware!). Lists that "garantuee" amortizedO(1)
append operations are still much faster and better at avoiding reallocations.This is an extremly complicated problem, because it's hard to "measure quantitativly". If you read the announcement:
If you take a closer look at it then you'll note that it mentions "certain circumstances". The tricky thing is to find out what these certain cirumstances are. One is immediatly obvious:
Otherwise it wouldn't be safe to change
s
.But another condition is:
O(1)
- that means without needing to copy the contents of the string to a new location.That's were it get's tricky. Because the system is responsible for doing a reallocation. That's nothing you can control from within python. However your system is smart. That means in many cases you can actually do the reallocation without needing to copy the contents. You might want to take a look at @TimPeters answer, that explains some of it in more details.
I'll approach this problem from an experimentalists point of view.
You can easily check how many reallocations actually need a copy by checking how often the ID changes (because the
id
function in CPython returns the memory adress):This gives a different number each run (or almost each run). It's somewhere around 500 on my computer. Even for
range(10000000)
it's just 5000 on my computer.But if you think that's really good at "avoiding" copies you're wrong. If you compare it to the number of resizes a
list
needs (list
s over-allocate intentionally soappend
is amortizedO(1)
):That only needs 105 reallocations (always).
I mentioned that
realloc
could be smart and I intentionally kept the "sizes" when the reallocs happened in a list. Many C allocators try to avoid memory fragmentation and at least on my computer the allocator does something different depending on the current size:Note that the x-axis represents the number of "copies done" not the size of the string!
That's graph was actually very interesting for me, because it shows clear patterns: For small arrays (up to 465 elements) the steps are constant. It needs to reallocate for every 8 elements added. Then it needs to actually allocate a new array for every character added and then at roughly 940 all bets are off until (roughly) one million elements. Then it seems it allocates in blocks of 4096 bytes.
My guess is that the C allocator uses different allocation schemes for differently sized objects. Small objects are allocated in blocks of 8 bytes, then for bigger-than-small-but-still-small arrays it stops to overallocate and then for medium sized arrays it probably positions them where they "may fit". Then for huge (comparativly speaking) arrays it allocates in blocks of 4096 bytes.
I guess the 8byte and 4096 bytes aren't random. 8 bytes is the size of an
int64
(orfloat64
akadouble
) and I'm on a 64bit computer with python compiled for 64bits. And 4096 is the page size of my computer. I assume there are lots of "objects" that need have these sizes so it makes sense that the compiler uses these sizes because it could avoid memory fragmentation.You probably know but just to make sure: For
O(1)
(amortized) append behaviour the overallocation must depend on the size. If the overallocation is constant it will beO(n**2)
(the greater the overallocation the smaller the constant factor but it's still quadratic).So on my computer the runtime behaviour will be always
O(n**2)
except for lengths (roughly) 1 000 to 1 000 000 - there it really seems to undefined. In my test run it was able to add many (ten-)thousand elements without ever needing a copy so it would probably "look likeO(1)
" when timed.Note that this is just my system. It could look totally different on another computer or even with another compiler on my computer. Don't take these too seriously. I provided the code to do the plots yourself, so you can analyze your system yourself.
You also asked the question (in the comments) if there would be downsides if you over-allocate strings. That's really easy: Strings are immutable. So any overallocated byte is wasting ressources. There are only a few limited cases where it really does grow and these are considered implementation details. The developers probably don't throw away space to make implementation details perform better, some python developers also think that adding this optimization was a bad idea.
In the end, the platform C allocators (like
malloc()
) are the ultimate source of memory. When CPython tries to reallocate string space to extend its size, it's really the system Crealloc()
that determines the details of what happens. If the string is "short" to begin with, chances are decent the system allocator finds unused memory adjacent to it, so extending the size is just a matter of the C library's allocator updating some pointers. But after repeating this some number of times (depending again on details of the platform C allocator), it will run out of space. At that point,realloc()
will need to copy the entire string so far to a brand new larger block of free memory. That's the source of quadratic-time behavior.Note, e.g., that growing a Python list faces the same tradeoffs. However, lists are designed to be grown, so CPython deliberately asks for more memory than is actually needed at the time. The amount of this overallocation scales up as the list grows, enough to make it rare that
realloc()
needs to copy the whole list-so-far. But the string optimizations do not overallocate, making cases whererealloc()
needs to copy far more frequent.When growing a contiguous array data structure (illustrated above) through appending to it, linear performance can be achieved if the extra space reserved while reallocating the array is proportional to the current size of the array. Obviously, for large strings this strategy is not followed, most probably with the purpose of not wasting too much memory. Instead a fixed amount of extra space is reserved during each reallocation, resulting in quadratic time complexity. To understand where the quadratic performance comes from in the latter case, imagine that no overallocation is performed at all (which is the boundary case of that strategy). Then at each iteration a reallocation (requiring linear time) must be performed, and the full runtime is quadratic.