I wrote these two solutions for Project Euler Q14, in assembly and in C++. They are the same identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with
g++ p14.cpp -o p14
Assembly, p14.asm
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp
#include <iostream>
using namespace std;
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = n*3 + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
cout << maxi << endl;
}
I know about the compiler optimizations to improve speed and everything, but I don't see many ways to optimize my assembly solution further (speaking programmatically not mathematically).
The C++ code has modulus every term and division every even term, where assembly is only one division per even term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking out of mainly curiosity.
Execution times
My system: 64 bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++
(unoptimized): avg 1272 msg++ -O3
avg 578 msoriginal asm (div) avg 2650 ms
Asm (shr)
avg 679 ms@johnfound asm, assembled with nasm avg 501 ms
@hidefromkgb asm avg 200 ms
@hidefromkgb asm optimized by @Peter Cordes avg 145 ms
@Veedrac C++ avg 81 ms with
-O3
, 305 ms with-O0
For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.
If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with
-O0
(compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks.
On Intel Haswell,
div r64
is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.shr rax, 1
does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.For comparison, 32-bit division is faster, but still horrible vs. shifts.
idiv r32
is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.As you can see from looking at gcc's
-O0
asm output (Godbolt compiler explorer), it only uses shifts instructions. clang-O0
does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see
div_by_13
in the above godbolt link).gcc -Os
(optimize for size) does use IDIV for non-power-of-2 division, unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.Helping the compiler
(summary for this case: use
uint64_t n
)First of all, it's only interesting to look at optimized compiler output. (
-O3
).-O0
speed is basically meaningless.Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory,
g++ -O3 -march=bdver3
andg++ -O3 -march=skylake
will do the right thing. (Or-march=native
.) Or-mtune=...
to just tune, without using instructions that other CPUs might not support.My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your
n
has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove thatn
can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)You should have used
uint64_t n
, so it can just SHR. And so it's portable to systems wherelong
is only 32-bit (e.g. x86-64 Windows).BTW, gcc's optimized asm output looks pretty good (using
unsigned long n
): the inner loop it inlines intomain()
does this:The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of
cmov
(produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using
inc edx
instead ofadd edx, 1
, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use
cmovc
instead oftest
/cmovz
.See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency. The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple
n
values in parallel in an interleaved loop (see below).LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components (
[base+idx+const]
, which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).So
lea rcx, [rax + rax*2]
/inc rcx
is only 2c latency, faster thanlea rcx, [rax + rax*2 + 1]
, on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in
main
, but that's fine because each thread can just check a range ofn
values and produce a pair of integers as a result.Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same
max
/maxi
. This creates more instruction-level parallelism.The trick is deciding whether to wait until all the
n
values have reached1
before getting another pair of startingn
values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where
n
hadn't reached1
yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors ofn
values up in the air. Maybe only worth with 256b vector (4xuint64_t
).I think the best strategy to make detection of a
1
"sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a1
in an element, the increment-vector will have a zero, and +=0 is a no-op.Untested idea for manual vectorization
You can and should implement this with intrinsics, instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
@EOF points out that
tzcnt
(orbsf
) could be used to do multiplen/=2
iterations in one step. That's probably better than SIMD vectorizing, because no SSE or AVX instruction can do that. It's still compatible with doing multiple scalarn
s in parallel in different integer registers, though.So the loop might look like this:
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding
n>>1
, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).We can safely use BSF or TZCNT interchangeably, because
n
can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use
REP BSF
, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use__builtin_ctzll
even with-mno-bmi
.They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination, when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is purely a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only software-visible upside is in the interaction with another microarchitectural limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dependency for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
@hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow:
ROR rdi,2
is faster thanSHRD rdi,rdi,2
, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in @hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one
write()
instead of writing one char at a time. This minimizes impact on timing the whole program withperf stat ./collatz
(to record performance counters), and I de-obfuscated some of the non-critical asm.@Veedrac's code
I got a very small speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter
k
and then adding it tocount
later changes what clang does, but that slightly hurts gcc.See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)
As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.
From comments:
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to
shr rax
producing ZF=1.Can we produce 2^64? If the starting number is
0x5555555555555555
, it's odd number, next number is then 3n+1, which is0xFFFFFFFFFFFFFFFF + 1
=0
. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. Thecmp rax,1
of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined0
number).How about some more complex number, which will create cycle without
0
? Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to
0
:227
,170
and85
(85
going directly to0
, other two progressing toward85
).But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already
27
is affected! It does reach value9232
in proper non-truncated series (first truncated value is322
in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is13120
(for the255
itself), maximum number of steps to converge to1
is about128
(+-2, not sure if "1" is to count, etc...).Interestingly enough (for me) the number
9232
is maximum for many other source numbers, what's so special about it? :-O9232
=0x2410
... hmmm.. no idea.Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with
cmp number,1
terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as0
after truncation.But the value
27
overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value1
, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling: