-->

Fastest Offset Read for a Small Array

2020-07-25 10:37发布

问题:

For speed, I would like to read one of 8 registers referenced by the value in a 9th register. The fastest way I see to do this is to use 3 conditional jumps (checking 3 bits in the 9th register). This should have shorter latency than the standard way of doing this with an offset memory read, but this still requires at least 6 clock cycles (at least one test plus one conditional jmp per bit check).

Is there any commercial CPU (preferably x86/x64) with an intrinsic to do this "offset register read" with a latency of just one clock cycle?

In theory, an optimized CPU could do this with one addition and one move, so two or one clock cycles seems easy...is there some general reason that architectures don't care about speeding up an offset read for a small array?

回答1:

Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array?

That said, you could use a computed jump. This also replaces a data dependency (indexed addressing mode) with a control dependency so out-of-order exec doesn't have to wait for the index input to even be ready before it can start running code that uses the final RAX. Of course this assumes correct branch prediction, which is unlikely if the index changes often. A branch mispredict costs many cycles of little work being done, but the small latency of a load that hits in L1d cache can overlap with independent work very easily.

The throughput cost is higher than an array in memory: some address computations, one jump, one move and a ret, instead of just a mov or even a memory operand with an indexed addressing mode.

To inline this code, simply replace the jmp *%rax with a call *%rax, costing another uop. Or replace the ret instructions with a jmp to a label at the bottom and increase the stride of the jump table to 8 to account for the longer encoding.

    # select a register from r8...r15 according to the value in rdi
select:
    lea labels-4*8(%rip),%rax # rdi = 8 is the first jump table entry
    lea (%rax,%rdi,4),%rax    # pointer to the appropriate entry
    jmp *%rax                 # computed jump

    .align 4
labels:
    mov %r8, %rax
    ret

    .align 4
    mov %r9, %rax
    ret

    .align 4
    mov %r10, %rax
    ret

    .align 4
    mov %r11, %rax
    ret

    .align 4
    mov %r12, %rax
    ret

    .align 4
    mov %r13, %rax
    ret

    .align 4
    mov %r14, %rax
    ret

    .align 4
    mov %r15, %rax
    ret

While this is probably faster than three conditional jumps (depending on access pattern), it surely won't beat just using an array.