Very specific optimisation task. I have 3 arrays:
- const char* inputTape
- const int* inputOffset, organised in a group of four
- char* outputTapeoutput
which i must assemble output tape from input, according to following 5 operations:
int selectorOffset = inputOffset[4*i];
char selectorValue = inputTape[selectorOffset];
int outputOffset = inputOffset[4*i+1+selectorValue];
char outputValue = inputTape[outputOffset];
outputTape[i] = outputValue; // store byte
and then advance counter.
All iterations are same and could be done all in parallel. Format of inputOffset could be a subject for change, but until same input will produce same output.
OpenCL on GPU fails on this algorithm (works same or even slower that cpu)
Assembly the best i got 5 mov, 1 lea, 1 dec instructions. Upd: thanks to Peter Cordes little hint
loop_start:
mov eax,dword ptr [rdx-10h] ; selector offset
movzx r10d,byte ptr [rax+r8] ; selector value
mov eax,dword ptr [rdx+r10*4-0Ch] ; output offset
movzx r10d,byte ptr [r8+rax] ; output value
mov byte ptr [r9+rcx-1],r10b ; store to outputTape
lea rdx, [rdx-10h] ; pointer to inputOffset for current
dec ecx ; loop counter, sets zero flag if (ecx == 0)
jne loop_start ; continue looping while non zero iterations left: ( ecx != 0 )
How could i optimise this for SSE/AVX operation? i am stumbled...