For speed, I would like to read one of 8 registers referenced by the value in a 9th register.
The fastest way I see to do this is to use 3 conditional jumps (checking 3 bits in the 9th
register). This should have shorter latency than the standard way of doing this with an offset
memory read, but this still requires at least 6 clock cycles (at least one test plus one
conditional jmp per bit check).
Is there any commercial CPU (preferably x86/x64) with an intrinsic to do this "offset register
read" with a latency of just one clock cycle?
In theory, an optimized CPU could do this with one addition and one move, so two or one clock
cycles seems easy...is there some general reason that architectures don't care about speeding
up an offset read for a small array?
Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array?
That said, you could use a computed jump. This also replaces a data dependency (indexed addressing mode) with a control dependency so out-of-order exec doesn't have to wait for the index input to even be ready before it can start running code that uses the final RAX. Of course this assumes correct branch prediction, which is unlikely if the index changes often. A branch mispredict costs many cycles of little work being done, but the small latency of a load that hits in L1d cache can overlap with independent work very easily.
The throughput cost is higher than an array in memory: some address computations, one jump, one move and a ret
, instead of just a mov
or even a memory operand with an indexed addressing mode.
To inline this code, simply replace the jmp *%rax
with a call *%rax
, costing another uop. Or replace the ret
instructions with a jmp
to a label at the bottom and increase the stride of the jump table to 8 to account for the longer encoding.
# select a register from r8...r15 according to the value in rdi
select:
lea labels-4*8(%rip),%rax # rdi = 8 is the first jump table entry
lea (%rax,%rdi,4),%rax # pointer to the appropriate entry
jmp *%rax # computed jump
.align 4
labels:
mov %r8, %rax
ret
.align 4
mov %r9, %rax
ret
.align 4
mov %r10, %rax
ret
.align 4
mov %r11, %rax
ret
.align 4
mov %r12, %rax
ret
.align 4
mov %r13, %rax
ret
.align 4
mov %r14, %rax
ret
.align 4
mov %r15, %rax
ret
While this is probably faster than three conditional jumps (depending on access pattern), it surely won't beat just using an array.