Assume that we have a list of strings and we want to create a string by concatenating all element in this list. Something like this:
def foo(str_lst):
result = ''
for element in str_lst:
result += element
return result
Since strings are immutable objects, I expect that python creates a new str object and copy contents of result and element at each iteration. It makes O(M * N^2)
time complexity, M is the length of each element and N is the size of the list.
However, my experiment shows that it runs in linear time.
N = 1000000 # 1 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 0.5 seconds
N = 2000000 # 2 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 1.0 seconds
N = 10000000 # 10 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 5.3 seconds
I suspect that python uses something like stringbuffer under the hood. So, it doesn't create new object at each iteration.
Now consider a slightly different implementation. The only difference is one extra assignment.
def foo2(str_lst):
result = ''
for element in str_lst:
result += element
temp = result # new added line
return result
I know that temp = result
line does not create a new object. temp
just points to the same object. So, this little change shouldn't affect the performance much.
N = 1000000 # 1 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 0.5 seconds
foo2(str_lst) # It takes around 30 seconds
N = 2000000 # 2 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 1 seconds
foo2(str_lst) # It takes around 129 seconds
However, there is a huge difference. And it looks like foo2 function is O(N^2) while foo is O(N).
My question is how does python achieve linear time in string concatenation without breaking other language components like immutable object assignment? and how does that extra line affect the performance that much? I searched a bit in the cpython implementation but couldn't find exact location.
Update
Here is the line profiling results.
result for foo function
Total time: 0.545577 s
File: <ipython-input-38-b9bb169e8fe0>
Function: foo at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo(str_lst):
2 1 2.0 2.0 0.0 result = ''
3 1000001 238820.0 0.2 43.8 for element in str_lst:
4 1000000 306755.0 0.3 56.2 result += element
5 1 0.0 0.0 0.0 return result
result for foo2 function
Total time: 30.6663 s
File: <ipython-input-40-34dd53670dd9>
Function: foo2 at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo2(str_lst):
2 1 2.0 2.0 0.0 result = ''
3 1000001 299122.0 0.3 1.0 for element in str_lst:
4 1000000 30033127.0 30.0 97.7 result += element
5 1000000 413283.0 0.4 1.3 temp = result
6 1 0.0 0.0 0.0 return result
Somehow temp = result
line affects the performance of result += element
line.
Actually, the behavior you are observing is determined by the behavior of the memory-allocator of the C-runtime on your OS.
CPython has an optimization, that if the unicode-object has only one reference, it can be changed in-place - nobody would register that the unicode-object loss its immutability for a moment. See my answer to this SO-question for more details.
In
foo2
, there is another reference to the unicode object (temp
), which prevents the in-place-optimization: Changing it in-place would break the immutability, because it could be observed throughtemp
.However, even with the inplace optimization, it is not obvious, why
O(n^2)
behavior can be avoided, as unicode object doesn't overallocate and thus has to exend the underlying buffer at every addition, which naively would mean a copy of the whole content (i.e.O(n)
) in every step.However, most of the time
realloc
(differently thanmalloc
+copy) can be done inO(1)
, because if the memory directly behind the the allocated buffer is free, it can be used to extend the original without copying.An interesting detail is that there is no guarantee, that
foo
will run inO(n)
: If the memory is fragemented (e.g. in a long running process).realloc
wont be able to extend the buffer without copying the data and thus the running time will becomeO(n^2)
.Thus one should not rely on this optimization to avoid quadratic running time.
Having another name pointing to the same object kills the optimisation. The optimisation basically works by resizing the string object and appending in place. If you have more than one references to that object, you can't resize without affecting the other reference. With strings being immutable, allowing this would be a serious flaw of the implementation.
increased the reference count for the string object named by
result
thereby prohibiting the optimisation.The full list of checks performed in the case of
+=
(which eventually translates toPyUnicode_Append
) can be seen in theunicode_modifiable
function. Among other things, it checks that the reference count of the object is equal to one, that it isn't interned and that it isn't a string subclass.There's a couple more checks in the
if
statement guarding this optimisation, if you want a more thorough list.Though not the basic issue of your question, future readers might be curious about how to efficiently perform string concatenations. Besides similar questions on S.O, the Python FAQ also has an entry on this.