I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?
void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
numPix /= 8; //process 8 pixels at a time
uint8x8_t alpha = vdup_n_u8 (0xff);
for (int i=0; i<numPix; i++)
{
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8x4_t bgra;
bgra.val[0] = rgb.val[2]; //these lines are slow
bgra.val[1] = rgb.val[1]; //these lines are slow
bgra.val[2] = rgb.val[0]; //these lines are slow
bgra.val[3] = alpha;
vst4_u8(dst, bgra);
src += 8*3;
dst += 8*4;
}
}
The ARMCC disassembly isn't that fast either :
It isn't using the most appropriate instructions
It mixes VFP instructions with NEON ones which causes huge hiccups every time
Try this :
My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.
This depends on the compiler. For example when I compile the code above with armcc (5.01) and disassemble it, what I get looks like (I'm just putting the loop and I moved alpha assignment outside of the loop)
If I compile the code with gcc (4.4.3) and disassemble again I get,
And the execution time was 10 times faster with armcc.
If I compile armcc produced assembly code for the function (it looks like now alpha is back in loop :)) with gcc (inline assembly)
I get the same execution time with gcc as well.
At the end what I suggest you is either disassemble your binary and check if the compiler produces what you want or use assembly.
Btw if you want to improve the execution time of this function even further, I suggest you to look into