This question already has an answer here:
Our code takes 10 minutes to siphon thru 68,000 records when we use:
new_file = new_file + line + string
However when we do the following it takes just 1 second:
new_file += line + string
Here is the code:
for line in content:
import time
import cmdbre
fname = "STAGE050.csv"
regions = cmdbre.regions
start_time = time.time()
with open(fname) as f:
content = f.readlines()
new_file_content = ""
new_file = open("CMDB_STAGE060.csv", "w")
row_region = ""
i = 0
for line in content:
if (i==0):
new_file_content = line.strip() + "~region" + "\n"
else:
country = line.split("~")[13]
try:
row_region = regions[country]
except KeyError:
row_region = "Undetermined"
new_file_content += line.strip() + "~" + row_region + "\n"
print (row_region)
i = i + 1
new_file.write(new_file_content)
new_file.close()
end_time = time.time()
print("total time: " + str(end_time - start_time))
All code I've ever written in python uses the first option. This is just basic string operations... we are reading input from a file, processing it and outputting it to the new file. I am 100% certain that the first method takes roughly 600 times longer to run than the second, but why?
The file being processed is a csv but uses ~ instead of a comma. All we are doing here is taking this csv, which has a column for country, and adding a column for the countries region, e.g. LAC, EMEA, NA, etc... cmdbre.regions is just a dictionary, with all ~200 countries as the key and each region as the value.
Once I changed to the append string operation... the loop completed in 1 second instead of 10 minutes... 68,000 records in the csv.
Actually, both could be equally slow, but for some optimization which are actually an implementation detail on the official Python runtime (cPython).
Strings in Python are immutable - that means that when you do "str1 + str2", Python has to create a third string object, and copy over all the contents from str1 and from str2 to it - no matter how large any of these parts is.
The inplace operator allows Python to use some internal optimizations so that all data in str1 is not necessarily copied over again - and probably it even allows some buffer space for further concatenation options.
When one gets the feeling on how the language works, the way to build a large text body from small strings is to create a Python list with all the strings, and after the looping is over, make a single call to the
str.join
method passing in all string components. That will be consistently fast, even across Python implementations, and not depending on optimizations to be able to be triggered.CPython (the reference interpreter) has an optimization for in-place string concatenation (when the string being appended to has no other references). It can't apply this optimization as reliably when doing
+
, only+=
(+
involves two live references, the assignment target and the operand, and the former isn't involved in the+
operation, so it's harder to optimize it).You shouldn't rely on this though, per PEP 8:
Update based on question edits: Yeah, you broke the optimization. You concatenated many strings, not just one, and Python evaluates left-to-right, so it must do the left-most concatenation first. Thus:
is not at all the same as:
because the former concatenates all the new pieces together, then appends them to the accumulator string all at once, while the latter must evaluate each addition from left to right with temporaries that don't involve
new_file_content
itself. Adding parens for clarity, it's like you did:Because it doesn't actually know the types until it reaches them, it can't assume all of those are strings, so the optimization doesn't kick in.
If you changed the second bit of code to:
or slightly slower, but still many times faster than your slow code because it keeps the CPython optimization:
so the accumulation was obvious to CPython, you'd fix the performance problem. But frankly, you should just be using
+=
anytime you're performing a logical append operation like this;+=
exists for a reason, and it provides useful information to both the maintainer and the interpreter. Beyond that, it's good practice as far as DRY goes; why name the variable twice when you don't need to?Of course, per PEP8 guidelines, even using
+=
here is bad form. In most languages with immutable strings (including most non-CPython Python interpreters), repeated string concatenation is a form of Schlemiel the Painter's Algorithm, which causes serious performance problems. The correct solution is to build alist
of strings, thenjoin
them all at one, e.g.:which is usually faster even when the CPython string concatenation options are available, and will be reliably fast on non-CPython Python interpreters as well because it uses a mutable
list
to accumulate results efficiently, then allows''.join
to precompute the total length of the string, allocate the final string all at once (instead of incremental resizes along the way), and populate it exactly once.Side-note: For your specific case, you shouldn't be accumulating or concatenating at all. You've got an input file and an output file, and can process line by line. Every time you would append or accumulate file contents, just write them out instead (I've cleaned up the code a bit for PEP8 compliance and other minor style improvements while I was at it):
Implementation details deep dive
For those curious on implementation details, the CPython string concat optimization is implemented in the byte code interpreter, not on the
str
type itself (technically,PyUnicode_Append
does the mutation optimization, but it requires help from the interpreter to fix up reference counts so it knows it can use the optimization safely; without interpreter help, only C extension modules would ever benefit from that optimization).When the interpreter detects that both operands are the Python level
str
type (at C layer, in Python 3, it's still referred to asPyUnicode
, a legacy of 2.x days that wasn't worth changing), it calls a specialunicode_concatenate
function, which checks whether the very next instruction is one of three basicSTORE_*
instructions. If it is, and the target is the same as the left operand, it clears the target reference soPyUnicode_Append
will see only a single reference to the operand, allowing it to invoke the optimized code for astr
with a single reference.This means that not only can you break the optimization by doing
you can also break it any time the variable in question is not a top level (global, nested or local) name. If you're operating on an object attribute, a
list
index, adict
value, etc., even+=
won't help you, it won't see a "simpleSTORE
", so it doesn't clear the target reference, and all of these get the ultraslow, not-in-place behavior:It's also specific to the
str
type; in Python 2, the optimization doesn't help withunicode
objects, and in Python 3, it doesn't help withbytes
objects, and in neither version will it optimize for subclasses ofstr
; those always take the slow path.Basically, the optimization is there to be as nice as possible in the simplest common cases for people new to Python, but it's not going to go to serious trouble for even moderately more complex cases. This just reinforces the PEP8 recommendation: Depending on implementation details of your interpreter is a bad idea when you could run faster on every interpreter, for any store target, by doing the right thing and using
str.join
.