Python string concatenation internal details

2020-06-17 05:53发布

Assume that we have a list of strings and we want to create a string by concatenating all element in this list. Something like this:

def foo(str_lst):
    result = ''
    for element in str_lst:
        result += element
    return result

Since strings are immutable objects, I expect that python creates a new str object and copy contents of result and element at each iteration. It makes O(M * N^2) time complexity, M is the length of each element and N is the size of the list.

However, my experiment shows that it runs in linear time.

N = 1000000 # 1 million
str_lst = ['a' for _ in range(N)]

foo(str_lst) # It takes around 0.5 seconds

N = 2000000 # 2 million
str_lst = ['a' for _ in range(N)]

foo(str_lst) # It takes around 1.0 seconds

N = 10000000 # 10 million
str_lst = ['a' for _ in range(N)]

foo(str_lst) # It takes around 5.3 seconds

I suspect that python uses something like stringbuffer under the hood. So, it doesn't create new object at each iteration.

Now consider a slightly different implementation. The only difference is one extra assignment.

def foo2(str_lst):
    result = ''
    for element in str_lst:
        result += element
        temp = result # new added line
    return result

I know that temp = result line does not create a new object. temp just points to the same object. So, this little change shouldn't affect the performance much.

N = 1000000 # 1 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 0.5 seconds
foo2(str_lst) # It takes around 30 seconds

N = 2000000 # 2 million
str_lst = ['a' for _ in range(N)]
foo(str_lst) # It takes around 1 seconds
foo2(str_lst) # It takes around 129 seconds

However, there is a huge difference. And it looks like foo2 function is O(N^2) while foo is O(N).

My question is how does python achieve linear time in string concatenation without breaking other language components like immutable object assignment? and how does that extra line affect the performance that much? I searched a bit in the cpython implementation but couldn't find exact location.

Update

Here is the line profiling results.

result for foo function

Total time: 0.545577 s
File: <ipython-input-38-b9bb169e8fe0>
Function: foo at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def foo(str_lst):
 2         1          2.0      2.0      0.0      result = ''
 3   1000001     238820.0      0.2     43.8      for element in str_lst:
 4   1000000     306755.0      0.3     56.2          result += element
 5         1          0.0      0.0      0.0      return result

result for foo2 function

Total time: 30.6663 s
File: <ipython-input-40-34dd53670dd9>
Function: foo2 at line 1
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 1                                           def foo2(str_lst):
 2         1          2.0      2.0      0.0      result = ''
 3   1000001     299122.0      0.3      1.0      for element in str_lst:
 4   1000000   30033127.0     30.0     97.7          result += element
 5   1000000     413283.0      0.4      1.3          temp = result
 6         1          0.0      0.0      0.0      return result

Somehow temp = result line affects the performance of result += element line.

2条回答
狗以群分
2楼-- · 2020-06-17 06:38

Actually, the behavior you are observing is determined by the behavior of the memory-allocator of the C-runtime on your OS.

CPython has an optimization, that if the unicode-object has only one reference, it can be changed in-place - nobody would register that the unicode-object loss its immutability for a moment. See my answer to this SO-question for more details.

In foo2, there is another reference to the unicode object (temp), which prevents the in-place-optimization: Changing it in-place would break the immutability, because it could be observed through temp.

However, even with the inplace optimization, it is not obvious, why O(n^2) behavior can be avoided, as unicode object doesn't overallocate and thus has to exend the underlying buffer at every addition, which naively would mean a copy of the whole content (i.e. O(n)) in every step.

However, most of the time realloc (differently than malloc+copy) can be done in O(1), because if the memory directly behind the the allocated buffer is free, it can be used to extend the original without copying.


An interesting detail is that there is no guarantee, that foo will run in O(n): If the memory is fragemented (e.g. in a long running process). realloc wont be able to extend the buffer without copying the data and thus the running time will become O(n^2).

Thus one should not rely on this optimization to avoid quadratic running time.

查看更多
女痞
3楼-- · 2020-06-17 06:48

Having another name pointing to the same object kills the optimisation. The optimisation basically works by resizing the string object and appending in place. If you have more than one references to that object, you can't resize without affecting the other reference. With strings being immutable, allowing this would be a serious flaw of the implementation.

temp = result

increased the reference count for the string object named by result thereby prohibiting the optimisation.

The full list of checks performed in the case of += (which eventually translates to PyUnicode_Append) can be seen in the unicode_modifiable function. Among other things, it checks that the reference count of the object is equal to one, that it isn't interned and that it isn't a string subclass.

There's a couple more checks in the if statement guarding this optimisation, if you want a more thorough list.


Though not the basic issue of your question, future readers might be curious about how to efficiently perform string concatenations. Besides similar questions on S.O, the Python FAQ also has an entry on this.

查看更多
登录 后发表回答