I wrote these two solutions for Project Euler Q14, in assembly and in C++. They are the same identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with
g++ p14.cpp -o p14
Assembly, p14.asm
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp
#include <iostream>
using namespace std;
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = n*3 + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
cout << maxi << endl;
}
I know about the compiler optimizations to improve speed and everything, but I don't see many ways to optimize my assembly solution further (speaking programmatically not mathematically).
The C++ code has modulus every term and division every even term, where assembly is only one division per even term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking out of mainly curiosity.
Execution times
My system: 64 bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++
(unoptimized): avg 1272 msg++ -O3
avg 578 msoriginal asm (div) avg 2650 ms
Asm (shr)
avg 679 ms@johnfound asm, assembled with nasm avg 501 ms
@hidefromkgb asm avg 200 ms
@hidefromkgb asm optimized by @Peter Cordes avg 145 ms
@Veedrac C++ avg 81 ms with
-O3
, 305 ms with-O0
The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.
Even without looking at assembly, the most obvious reason is that
/= 2
is probably optimized as>>=1
and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing
/=2
as>>=1
would be the best 1st place to look.On an unrelated note, if
n
is odd, the expressionn*3+1
will always be even. So there is no need to check. You can change that branch toSo the whole statement would then be
On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by @ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element
N
(shown first):To leap past these 2 elements means to compute
(N >> 1) + N + 1
,((N << 1) + N + 1) >> 1
andN >> 2
, respectively.Let`s prove that for both cases (1) and (2) it is possible to use the first formula,
(N >> 1) + N + 1
.Case (1) is obvious. Case (2) implies
(N & 1) == 1
, so if we assume (without loss of generality) that N is 2-bit long and its bits areba
from most- to least-significant, thena = 1
, and the following holds:where
B = !b
. Right-shifting the first result gives us exactly what we want.Q.E.D.:
(N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1
.As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
Here we compare
n > 2
because the process may stop at 2 instead of 1 if the total length of the sequence is odd.[EDIT:]
Let`s translate this into assembly!
Use these commands to compile:
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)
Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better that the compiler can, and this particular situation is good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
The whole code looks like:
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with
-O0
: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with-O3
.The output of both programs is the same: max sequence = 525 on i = 837799.
You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.
For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization (https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation: