How to load a pixel struct into an SSE register?

I have a struct of 8-bit pixel data:

struct __attribute__((aligned(4))) pixels {
    char r;
    char g;
    char b;
    char a;
}

I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers?

Unpacking unsigned pixels with SSE2

Ok, using SSE2 integer intrinsics from <emmintrin.h> first load the thing into the lower 32 bits of the register:

__m128i xmm0 = _mm_cvtsi32_si128(*(const int*)&pixel);

Then first unpack those 8-bit values into 16-bit values in the lower 64 bits of the register, interleaving them with 0s:

xmm0 = _mm_unpacklo_epi8(xmm0, _mm_setzero_si128());

And again unpack those 16-bit values into 32-bit values:

xmm0 = _mm_unpacklo_epi16(xmm0, _mm_setzero_si128());

You should now have each pixel as 32-bit integer in the respective 4 components of the SSE register.

Unpacking signed pixels with SSE2

I just read, that you want to get those values as 32-bit signed integers, though I wonder what sense a signed pixel in [-127,127] makes. But if your pixel values can indeed be negative, the interleaving with zeros won't work, since it makes a negative 8-bit number into a positive 16-bit number (thus interprets your numbers as unsigned pixel values). A negative number has to be extended with 1s instead of 0s, but unfortunately that would have to be decided dynamically on a component by component basis, at which SSE is not that good.

What you could do is compare the values for negativity and use the resulting mask (which fortunately uses 1...1 for true and 0...0 for false) as interleavand, instead of the zero register:

xmm0 = _mm_unpacklo_epi8(xmm0, _mm_cmplt_epi8(xmm0, _mm_setzero_si128()));
xmm0 = _mm_unpacklo_epi16(xmm0, _mm_cmplt_epi16(xmm0, _mm_setzero_si128()));

This will properly extend negative numbers with 1s and positives with 0s. But of course this additional overhead (in the form of probably 2-4 additional SSE instructions) is only neccessary if your initial 8-bit pixel values can ever be negative, which I still doubt. But if this is really the case, you should rather consider signed char over char, as the latter has implementation-defined signedness (in the same way you should use unsigned char if those are the common unsigned [0,255] pixel values).

Alternative SSE2 unpacking using shifts

Although, as clarified, you don't need signed-8-bit to 32-bit conversion, but for the sake of completeness harold had another very good idea for the SSE2-based sign-extension, instead of using the above mentioned comparison based version. We first unpack the 8-bit values into the upper byte of the 32-bit values instead of the lower byte. Since we don't care for the lower parts, we just use the 8-bit values again, which frees us from the need for an extra zero-register and an additional move:

xmm0 = _mm_unpacklo_epi8(xmm0, xmm0);
xmm0 = _mm_unpacklo_epi16(xmm0, xmm0);

Now we just need to perform and arithmetic right-shift of the upper byte into the lower byte, which does the proper sign-extension for negative values:

xmm0 = _mm_srai_epi32(xmm0, 24);

This should be more instruction count and register efficient than my above SSE2-version.

And as it should even be equal in instruction count for a single pixel (though 1 more instruction when amortized over many pixels) and more register efficient (due to no extra zero-register) compared to the above zero-extension, it might even be used for the unsigned-to-signed conversion if registers are rare, but then with a logical shift (_mm_srli_epi32) instead of an arithmetic shift.

Improved unpacking with SSE4

Thanks to harold's comment, there is even a better option for the first 8-to-32 transformation. If you have SSE4 support (SSE4.1 to be precise), which has instructions for doing the complete conversion from 4 packed 8-bit values in the lower 32 bits of the register into 4 32-bit values in the whole register, both for signed and unsigned 8-bit values:

xmm0 = _mm_cvtepu8_epi32(xmm0);   //or _mm_cvtepi8_epi32 for signed 8-bit values

Packing pixels with SSE2

As for the follow-up of reversing this transformation, first we pack the signed 32-bit integers into signed 16-bit integers and saturating:

xmm0 = _mm_packs_epi32(xmm0, xmm0);

Then we pack those 16-bit values into unsigned 8-bit values using saturation:

xmm0 = _mm_packus_epi16(xmm0, xmm0);

We can then finally take our pixel from the lower 32-bits of the register:

*(int*)&pixel = _mm_cvtsi128_si32(xmm0);

Due to the saturation, this whole process will autmatically map any negative values to 0 and any values greater than 255 to 255, which is usually intended when working with color pixels.

If you actually need truncation instead of saturation when packing the 32-bit values back into unsigned chars, then you will need to do this yourself, since SSE only provides saturating packing instructions. But this can be achieved by doing a simple:

xmm0 = _mm_and_si128(xmm0, _mm_set1_epi32(0xFF));

right before the above packing procedure. This should amount to just 2 additional SSE instructions, or only 1 additional instruction when amortized over many pixels.