depth transformation with ARM neon intrinsics

I'm trying to wrap my head around NEON intrinsics, and figured I could start with an example and ask some questions.

In this experiment I want to convert 32bit RGB to 16bit BGR. What would be a good start in converting the following code to use NEON intrinsics? The problem I'm having here is that 16bit doesn't match any intrinsic that I can see. There's 16x4 16x8, etc.. but I'm just having little luck wrapping my thoughts around how I need to approach this. Any tips?

Here's the code I'm trying to convert.

typedef struct {
    uint16_t b:5, g:6, r:5;
} _color16;

static int depth_transform_32_to_16_c (VisVideo *dest, VisVideo *src)
{
    int x, y;
    int w;
    int h;

    _color16 *dbuf = visual_video_get_pixels (dest);
    uint8_t *sbuf = visual_video_get_pixels (src);

    uint16x8

    int ddiff;
    int sdiff;

    depth_transform_get_smallest (dest, src, &w, &h);

    ddiff = (dest->pitch / dest->bpp) - w;
    sdiff = src->pitch - (w * src->bpp);

    for (y = 0; y < h; y++) {
        for (x = 0; x < w; x++) {
            dbuf->b = *(sbuf++) >> 3;
            dbuf->g = *(sbuf++) >> 2;
            dbuf->r = *(sbuf++) >> 3;

            dbuf++;
            sbuf++;
        }

        dbuf += ddiff;
        sbuf += sdiff;
    }

    return VISUAL_OK;
}

Edit: oh, for some reason I was looking at this considering 16x3 bits, but we're looking at 5,6,5 = 16bits. I realize I need shifts. Hmm.

NEON uses 128 bit wide registers so conceptually what you want to do is read in four pixels of 32bit RGB, use bitwise operations on those, and eventually write out your 16 bit pixels. One observation is that for best performance you may want to combine two 128 bit inputs (8 32-bit pixels) and produce one 128 output. This will make your memory accesses more efficient.

Another way to think about this is that you are taking your inner loop content and are doing four pixels in parallel. The reason it's a little hard to work with your original code is that some of the "magic" is hidden because you're using bit fields. If you rewrote your C code to work from 32 bit to 16 bit and used shifts/and/or that code would translate to SIMD more naturally and you can visualize how you'd work with multiple data within that context.

If you just look at each 32 bit component -> 16 bit transformations:

00000000RRRRRRRRGGGGGGGGBBBBBBBB
0000000000000000BBBBBGGGGGGRRRRR

This can help you visualize what you need to do in parallel for four pixels. Shift, extract, and combine. You can think about this as 4 32 bit lanes though for some of the bit operations the register width doesn't matter (e.g. or-ing 4 32-bit registers or 8 16-bit registers is the same).

Rough pseudo-code:

Read (vector load) 128 bit register = 4 32 bit pixels.
Shift green (all four components) into right bit position.
Mask out green (using AND mask) into another register. (conceptually still in 4x32 bit "mode")
Shift red (all four components) into right bit position.
Mask out red into yet another register.
Shift blue into right bit position.
Mask out blue into another register.
Shift red and blue to right bit positions.
Use-bitwise OR to combine.
Now you'll have 4 16 bit values with 32 bit alignment. (all so far still conceptually as 4x32 bit)
Repeat with another set of 4 pixels.
Combine those two sets with a NEON unzip (VUZP) to produce one 128 bit/8 pixel register.
Write (vector store) those pixels.