I'm investigating a minimal opcode size x86-64 strlen implementation for my code golfing / binary executable that is not supposed to exceed some size (think of demoscene for simplicity).
General idea comes from here, size optimization ideas from here and here.
Input string address is in rdi
, max length should be not bigger than Int32
xor eax,eax ; 2 bytes
or ecx,-1 ; 3 bytes
repne scasb ; 2 bytes
not ecx ; 2 bytes
dec ecx ; 2 bytes
Final result is in ecx
in 11 bytes total.
The question is about setting ecx
to -1
Option 1 already stated
or ecx,-1 ; 3 bytes
Option 2
lea ecx,[rax-1] ; 3 bytes
Option 3
stc ; 1 byte
sbb ecx,ecx ; 2 bytes
Option 4 , probably the slowest one
push -1 ; 2 bytes
pop rcx ; 1 byte
I understand that:
Option 1 has dependency on previous ecx
value
Option 2 has dependency on previous rax
value
Option 3 I'm not sure if it has dependency on previous ecx
value?
Option 4 is the slowest one?
Is there a clear winner here?
The criteria is to keep the opcodes size as small as possible and choose the best one performance wise.
I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.
I did some benchmarking on Intel Core i7 4850HQ Haswell 2,3 GHz, release build no debugger attached. In each loop I measure 1000 sequences of asm instructions and repeat it 10 milion times to average result.
I've made macros for repeating asm instructions 100 times.
Testing C code with inline asm for MacOS
Results
205-216 ns
321-355 ns
322-359 ns
612-692 ns
For a hacky good-enough version, we know
rdi
has a valid address. It's very likely thatedi
is not a small integer, thus 2 bytemov ecx, edi
. Check whether this is safe for all call-sites before using!This is great if you just want rdi pointing to the terminating
0
byte, instead of actually needing a count. Or if you have the start pointer in another register so you can dosub edi, edx
or something and get the length that way instead of processing thercx
result. (If you know the result fits in 32 bits, you don't needsub rdi, rdx
because you know the upper bits of that would be zero anyway. And high input bits don't affect low output bits for add/sub; carry propagates left to right.)For strings known to be under 255 bytes, you can use
mov cl, -1
(2 bytes). That makesrcx
at least 0xFF, and higher depending on what high garbage was left in it. (This has a partial-reg stall on Nehalem and earlier when RCX is read, otherwise just a dependency on the old RCX). Anyway, thenmov al, -2
/sub al, cl
to get the length as an 8-bit integer. This may or may not be useful.Depending on the caller,
rcx
might have already been holding a pointer value, in which case you could leave it untouched if you can use pointer subtraction.Out of the options you proposed
lea ecx,[rax-1]
is very good because you just xor-zeroedeax
, and it's a cheap 1 uop instruction with 1 cycle latency and can run on multiple execution ports on all mainstream CPUs.When you already have another register with a known constant value, especially one that's xor-zeroed, 3-byte
lea
almost always the most efficient 3-byte way to create a constant, if it works. (See Set all bits in CPU register to 1 efficiently).Yes,
repne scasb
is very compact. Its startup overhead is maybe something like 15 cycles on a typical Intel CPU, and according to Agner Fog, it issues >=6n uops with a throughput of >= 2n cycles, wheren
is the count (i.e. 2 cycles per byte it compares for long compares, where the startup overhead is hidden), so it dwarfs the cost oflea
.Something with a false dependency on
ecx
could delay its startup, so you definitely wantlea
.repne scasb
is probably fast enough for whatever you're doing, but it's slower thanpcmpeqb
/pmovmsbk
/cmp
. For short fixed-length strings, integercmp
/jne
is very good when the length is 4 or 8 bytes (including the terminating 0), assuming you can safely over-read your strings, i.e. you don't have to worry about""
at the end of a page. This method has overhead that scales with string length, though. For string length=7, for example, you could do 4, 2, and 1 operand sizes, or you could do two dword compares overlapping by 1 byte. likecmp dword [rdi], first_4_bytes / jne
;cmp dword [rdi+3], last_4_bytes / jne
.More details about LEA
On a Sandybridge-family CPU, the
lea
could be dispatched to an execution unit in the same cycle as it and thexor
-zero were issued into the out-of-order CPU core.xor
-zeroing is handled in the issue/rename stage, so the uop enters the ROB in an "already-executed" state. It's impossible for the instruction to ever have to wait for RAX. (Unless an interrupt happens between the xor and thelea
, but even then I think there'd be a serializing instruction after restoring RAX and before thelea
could execute, so it couldn't be stuck waiting.)Simple
lea
can run on port0 or port1 on SnB, or port1 / port5 on Skylake (2 per clock throughput, but sometimes different ports on different SnB-family CPUs). It's 1 cycle latency, so it's hard to do much better.It's unlikely you'd see any speedup from using
mov ecx, -1
(5 bytes) which can run on any ALU port.On an AMD Ryzen,
lea r32, [m]
in 64-bit mode is treated as a "slow" LEA that can only run on 2 ports, and has 2c latency instead of 1. Worse, Ryzen doesn't eliminate xor-zeroing.The microbenchmark test you did only measures throughput for the versions with no false dependencies, not latency. This is often a useful measure, and you did happen to get the right answer that
lea
is the best choice.Whether pure throughput accurately reflects anything about your real use case is another matter. You might actually depend on the latency, not throughput, if the string compare is on the critical path as part of a long or loop-carried data dependency chain not broken by a
jcc
to give you branch-prediction + speculative execution. (But branchless code is often larger, so this is unlikely).stc
/sbb ecx,ecx
is interesting, but only AMD CPUs treatsbb
as dependency-breaking (only depending on CF, not the integer register). On Intel Haswell and earlier,sbb
is a 2 uop instruction (because it has 3 inputs: 2 GP integer + flags). It has 2c latency, which is why it performs so badly. (The latency is a loop-carried dep chain.)Shortening other parts of the sequence
Depending what you're doing, you may be able to use
strlen+2
just as well, but offsetting another constant or something.dec ecx
is only 1 byte in 32-bit code, but x86-64 doesn't have the short-forminc/dec
instructions. So not / dec isn't as cool in 64-bit code.After
repne scas
, you haveecx = -len - 2
(if you started withecx = -1), and
notgives you
-x-1(i.e.
+len + 2 - 1`).