Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes.
There is an intrinsic _mm_cvtps_pi8
that will convert 32-bit to 8-bit signed int, but the problem there is that any value over 127 gets clamped to 127. I can't find any instructions that will clamp to unsigned 8-bit values.
I have an intuition that what I may want to do is some combination of _mm_cvtps_pi16
and _mm_shuffle_pi8
followed by move instruction to get the four bytes I care about into memory. Is that the best way to do it? I'm going to see if I can figure out how to encode the shuffle control mask.
UPDATE: The following appears to do exactly what I want. Is there a better way?
#include <tmmintrin.h>
#include <stdio.h>
unsigned char out[8];
unsigned char shuf[8] = { 0, 2, 4, 6, 128, 128, 128, 128 };
float ins[4] = {500, 0, 120, 240};
int main()
{
__m128 x = _mm_load_ps(ins); // Load the floats
__m64 y = _mm_cvtps_pi16(x); // Convert them to 16-bit ints
__m64 sh = *(__m64*)shuf; // Get the shuffle mask into a register
y = _mm_shuffle_pi8(y, sh); // Shuffle the lower byte of each into the first four bytes
*(int*)out = _mm_cvtsi64_si32(y); // Store the lower 32 bits
printf("%d\n", out[0]);
printf("%d\n", out[1]);
printf("%d\n", out[2]);
printf("%d\n", out[3]);
return 0;
}
UPDATE2: Here's an even better solution based on Harold's answer:
#include <smmintrin.h>
#include <stdio.h>
unsigned char out[8];
float ins[4] = {10.4, 10.6, 120, 100000};
int main()
{
__m128 x = _mm_load_ps(ins); // Load the floats
__m128i y = _mm_cvtps_epi32(x); // Convert them to 32-bit ints
y = _mm_packus_epi32(y, y); // Pack down to 16 bits
y = _mm_packus_epi16(y, y); // Pack down to 8 bits
*(int*)out = _mm_cvtsi128_si32(y); // Store the lower 32 bits
printf("%d\n", out[0]);
printf("%d\n", out[1]);
printf("%d\n", out[2]);
printf("%d\n", out[3]);
return 0;
}
There is no direct conversion from float to byte,
_mm_cvtps_pi8
is a composite._mm_cvtps_pi16
is also a composite, and in this case it's just doing some pointless stuff that you undo with the shuffle. They also return annoying__m64
's.Anyway, we can convert to dwords (signed, but that doesn't matter), and then pack (unsigned) or shuffle them into bytes.
_mm_shuffle_(e)pi8
generates apshufb
, Core2 45nm and AMD processors aren't too fond of it and you have to get a mask from somewhere.Either way you don't have to round to the nearest integer first, the convert will do that. At least, if you haven't messed with the rounding mode.
Using packs 1: (not tested) -- probably not useful,
packusdw
already outputs unsigned words but thenpackuswb
wants signed words again. Kept around because it is referred to elsewhere.Using different shuffles:
Using shuffle: (not tested)
We can solve the unsigned clamping issue by doing the first stage of packing with signed saturation.
[0-255]
fits in a signed 16-bit int, so values in that range will remain unclamped. Values outside that range will stay on the same side of it. Thus, the signed16 -> unsigned8 step will clamp them correctly.This only requires SSE2, not SSE4.1 for
packusdw
.I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte.
packuswd
is only useful if your final goal isuint16_t
, rather than further packing. (Since then you'd need to mask off the sign bit before feeding it to a further pack).If you did use
packusdw -> packuswb
, you'd get bogus results when the first step saturated to auint16_t
> 0x7fff.packuswb
would interpret that as a negativeint16_t
and saturate it to 0.packssdw
would saturate such inputs to0x7fff
, the maxint16_t
.(If your 32-bit inputs are always <= 0x7fff, you can use either, but SSE4.1
packusdw
takes more instruction bytes than SSE2packsswd
, and never runs faster.)If your source values can't be negative, and you only have one vector of 4 floats, not many, you can use harold's
pshufb
idea. If not, you need to clamp negative values to zero rather than truncate the by shuffling the low bytes into place.Using
may be slightly more efficient than using two
pack
instructions, becausepmax
can run on port 1 or 5 (Intel Haswell).cvtps2dq
is port 1 only,pshufb
andpack*
are port 5 only.