I'm trying to wrap my head around NEON intrinsics, and figured I could start with an example and ask some questions.
In this experiment I want to convert 32bit RGB to 16bit BGR. What would be a good start in converting the following code to use NEON intrinsics? The problem I'm having here is that 16bit doesn't match any intrinsic that I can see. There's 16x4 16x8, etc.. but I'm just having little luck wrapping my thoughts around how I need to approach this. Any tips?
Here's the code I'm trying to convert.
typedef struct {
uint16_t b:5, g:6, r:5;
} _color16;
static int depth_transform_32_to_16_c (VisVideo *dest, VisVideo *src)
{
int x, y;
int w;
int h;
_color16 *dbuf = visual_video_get_pixels (dest);
uint8_t *sbuf = visual_video_get_pixels (src);
uint16x8
int ddiff;
int sdiff;
depth_transform_get_smallest (dest, src, &w, &h);
ddiff = (dest->pitch / dest->bpp) - w;
sdiff = src->pitch - (w * src->bpp);
for (y = 0; y < h; y++) {
for (x = 0; x < w; x++) {
dbuf->b = *(sbuf++) >> 3;
dbuf->g = *(sbuf++) >> 2;
dbuf->r = *(sbuf++) >> 3;
dbuf++;
sbuf++;
}
dbuf += ddiff;
sbuf += sdiff;
}
return VISUAL_OK;
}
Edit: oh, for some reason I was looking at this considering 16x3 bits, but we're looking at 5,6,5 = 16bits. I realize I need shifts. Hmm.
NEON uses 128 bit wide registers so conceptually what you want to do is read in four pixels of 32bit RGB, use bitwise operations on those, and eventually write out your 16 bit pixels. One observation is that for best performance you may want to combine two 128 bit inputs (8 32-bit pixels) and produce one 128 output. This will make your memory accesses more efficient.
Another way to think about this is that you are taking your inner loop content and are doing four pixels in parallel. The reason it's a little hard to work with your original code is that some of the "magic" is hidden because you're using bit fields. If you rewrote your C code to work from 32 bit to 16 bit and used shifts/and/or that code would translate to SIMD more naturally and you can visualize how you'd work with multiple data within that context.
If you just look at each 32 bit component -> 16 bit transformations:
This can help you visualize what you need to do in parallel for four pixels. Shift, extract, and combine. You can think about this as 4 32 bit lanes though for some of the bit operations the register width doesn't matter (e.g. or-ing 4 32-bit registers or 8 16-bit registers is the same).
Rough pseudo-code: