Solving a recent Advent of Code problem, I found my default Python was ~40x slower than PyPy. I was able to get that down to about 17x with this code by limiting calls to len
and limiting global lookups by running it in a function.
Right now, e.py
runs in 5.162 seconds on python 3.6.3 and .297 seconds on PyPy on my machine.
My question is: is this the irreducible speedup of JIT, or is there some way to speed up the CPython answer? (short of extreme means: I could go to Cython/Numba or something?) How do I convince myself that there's nothing more I can do?
See the gist for the list of numbers input file.
As described in the problem statement, they represent jump offsets. position += offsets[current]
, and increment the current offset by 1. You're done when a jump takes you outside the list.
Here's the example given (the full input that takes 5 seconds is much longer, and has larger numbers):
(0) 3 0 1 -3 - before we have taken any steps.
(1) 3 0 1 -3 - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1.
2 (3) 0 1 -3 - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2.
2 4 0 1 (-3) - jump all the way to the end; leave a 4 behind.
2 (4) 0 1 -2 - go back to where we just were; increment -3 to -2.
2 5 0 1 -2 - jump 4 steps forward, escaping the maze.
The code:
def run(cmds):
location = 0
counter = 0
while 1:
try:
cmd = cmds[location]
if cmd >= 3:
cmds[location] -= 1
else:
cmds[location] += 1
location += cmd
if location < 0:
print(counter)
break
counter += 1
except:
print(counter)
break
if __name__=="__main__":
text = open("input.txt").read().strip().split("\n")
cmds = [int(cmd) for cmd in text]
run(cmds)
edit: I compiled and ran the code with Cython, that dropped runtime to 2.53s, but that's still almost an order of magnitude slower than PyPy.
edit: Numba gets me to within 2x
edit: The best Cython I could write got down to 1.32s, a little over 4x pypy
edit: Moving the cmd
variable into a cdef, as suggested by @viraptor, got the Cython down to .157 seconds! Nearly a full order of magnitude faster, and not that far from regular python. Still, I come away from this extremely impressed with the PyPy JIT, which did all this for free!
As a baseline for Python, I wrote this in C (and then decided to use C++ for more convenient array I/O). It compiles efficiently for x86-64 with clang++. This runs 82x faster than CPython3.6.2 with the code in the question, on a Skylake x86, so even the faster Python versions are still a couple factors away from keeping up with near-optimal machine code. (Yes, I looked at the compiler's asm output to check that it did a good job).
Update: transforming the array of offsets into an array of pointers speeds it up by another factor of 1.5x (simple addressing mode + removing an
add
from the critical path loop-carried dependency chain, bringing it down to just the 4 cycle L1D load-use latency for a simple addressing mode, not 6c = 5c + 1c). But I think we can be generous to Python and not expect it to keep up with algorithm transformations to suit modern CPUs :P (Especially because I used 32-bit pointers even in 64-bit mode to make sure the 4585 element array was still only 18kiB so it fits in the 32kiB L1D cache.)Also, an alternate expression for the update gets gcc to compile it branchless as well. (Commented out and original
perf stat
output left in this answer, because it's interesting the see the effect of branchless vs. branchy with mispredicts.)With clang4.0.1
-O3 -march=skylake
, the inner loop is branchless; it uses a conditional-move for the>=3
condition. I used? :
because that's what I was hoping the compiler would do. Source + asm on the Godbolt compiler explorerHere's the output of
perf stat ./a.out < input.txt
(for the clang version), on my i7-6700k Skylake CPU:The average instructions-per-clock is much lower than 4 because of the data dependency in the loop. The load address for the next iteration depends on the load+add for this iteration, and out-of-order execution can't hide that. It can overlap all the work of updating the value of the current location, though.
Changing from
int
toshort
has no performance downside (as expected;movsx
has the same latency asmov
on Skylake), but halves the memory consumption so a larger array could fit in L1D cache if needed.I tried compiling the array into the program (as
int offsets[] = { file contents with commas added };
so it doesn't have to be read and parsed. It also made the size a compile-time constant. This reduced the run time to ~36.2 +/- 0.1 milliseconds, down from ~36.8, so the version that reads from a file was still spending most of its was the actual problem, not parsing the input.As described earlier, pointer-chasing with a simple addressing mode like
[rdi]
instead of[rdi + rdx*4]
has 1c lower latency, and avoids theadd
(index += offset
becomescurrent = target
). Intel since IvyBridge has zero-latency integermov
between registers so that doesn't lengthen the critical path. Here's the source (with comments) + asm for this hacky optimization. A typical run (with text parsing into astd::vector
):23.26 +- 0.05 ms
, 90.725 M cycles (3.900 GHz),288.724 M instructions
(3.18 IPC). Interestingly it's actually more total instructions, but runs much faster due to the lower latency of the loop-carried dependency chain.gcc uses a branch and it's about 2x slower. (14% of branches are mispredicted according to
perf stat
on the whole program. It's only branching as part of updating the values, but branch misses stall the pipeline so they affect the critical path too, in a way that data dependencies don't here. This seems like a poor decision by the optimizer.)Rewriting the conditional as
offset[location] = (off>=3) ? off-1 : off+1;
convinces gcc to go branchless with asm that looks good.gcc7.1.1 -O3 -march=skylake (for the original source that compiles with a branch for
(off <= 3) ? : -1 : +1
).vs. CPython (Python3.6.2 on Arch Linux):
I don't have PyPy or other Python implementations installed, sorry.
You can improve small things, but pypy will (most likely) always be faster in this task.
For both CPython and Cython:
For Cython:
int
s and the commands as an array ofint
s to skip most type checks.