how to optimise double dereferencing?

2019-07-21 18:25发布

问题:

Very specific optimisation task. I have 3 arrays:

const char* inputTape
const int* inputOffset, organised in a group of four
char* outputTapeoutput

which i must assemble output tape from input, according to following 5 operations:

int selectorOffset = inputOffset[4*i];
char selectorValue = inputTape[selectorOffset];
int outputOffset = inputOffset[4*i+1+selectorValue];
char outputValue = inputTape[outputOffset];
outputTape[i] = outputValue; // store byte

and then advance counter.

All iterations are same and could be done all in parallel. Format of inputOffset could be a subject for change, but until same input will produce same output.

OpenCL on GPU fails on this algorithm (works same or even slower that cpu)

Assembly the best i got 5 mov, 1 lea, 1 dec instructions. Upd: thanks to Peter Cordes little hint

loop_start:
mov         eax,dword ptr [rdx-10h]             ; selector offset
movzx       r10d,byte ptr [rax+r8]          ; selector value
mov         eax,dword ptr [rdx+r10*4-0Ch]       ; output offset
movzx       r10d,byte ptr [r8+rax]          ; output value
mov         byte ptr [r9+rcx-1],r10b            ; store to outputTape
lea         rdx, [rdx-10h]                  ; pointer to inputOffset for current 
dec         ecx                             ; loop counter, sets zero flag if (ecx == 0)
jne         loop_start                      ; continue looping while non zero iterations left: ( ecx != 0 )

How could i optimise this for SSE/AVX operation? i am stumbled...

UPD: better to see it than to hear it..