SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:
_mm_cvtpd_epi64()
_mm_cvtepi64_pd()
It seems that AVX doesn't have them either.
What is the most efficient way to simulate these intrinsics?
This answer is about 64 bit integer to double conversion, without cutting corners. In a previous version of this answer (see paragraph Fast and accurate conversion by splitting ...., below), it was shown that it is quite efficient to split the 64-bit integers in a 32-bit low and a 32-bit high part, convert these parts to double, and compute
low + high * 2^32
.The instruction counts of these conversions were:
int64_to_double_full_range
9 instructions (withmul
andadd
as onefma
)uint64_to_double_full_range
7 instructions (withmul
andadd
as onefma
)Inspired by Mysticial's updated answer, with better optimized accurate conversions, I further optimized the
int64_t
to double conversion:int64_to_double_fast_precise
: 5 instructions.uint64_to_double_fast_precise
: 5 instructions.The
int64_to_double_fast_precise
conversion takes one instruction less than Mysticial's solution. Theuint64_to_double_fast_precise
code is essentially identical to Mysticial's solution (but with avpblendd
instead ofvpblendw
). It is included here because of its similarities with theint64_to_double_fast_precise
conversion: The instructions are identical, only the constants differ:The conversions may fail if unsafe math optimization options are enabled. With gcc,
-O3
is safe, but-Ofast
may lead to wrong results, because we may not assume associativity of floating point addition here (the same holds for Mysticial's conversions). With icc use-fp-model precise
.Fast and accurate conversion by splitting the 64-bit integers in a 32-bit low and a 32-bit high part.
We assume that both the integer input and the double output are in 256 bit wide AVX registers. Two approaches are considered:
int64_to_double_based_on_cvtsi2sd()
: as suggested in the comments on the question, usecvtsi2sd
4 times together with some data shuffling. Unfortunately bothcvtsi2sd
and the data shuffling instructions need execution port 5. This limits the performance of this approach.int64_to_double_full_range()
: we can use Mysticial's fast conversion method twice in order to get an accurate conversion for the full 64 bit integer range. The 64-bit integer is split in a 32-bit low and a 32-bit high part ,similarly as in the answers to this question: How to perform uint32/float conversion with SSE? . Each of these pieces is suitable for Mysticial's integer to double conversion. Finally the high part is multiplied by 2^32 and added to the low part. The signed conversion is a little bit more complicted than the unsigned conversion (uint64_to_double_full_range()
), becausesrai_epi64()
doesn't exist.Code:
The actual performance of these functions may depend on the surrounding code and the cpu generation.
Timing results for 1e9 conversions (256 bit wide) with simple tests B, C, D, and E in the code above, on an intel skylake i5 6500 system:
The difference in throughput between
int64_to_double_full_range()
andint64_to_double_based_on_cvtsi2sd()
is larger than I expected.There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like
_mm512_cvtpd_epi64
and the narrower AVX512VL versions, like_mm256_cvtpd_epi64
.If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)
If you're willing to cut corners,
double <-> int64
conversions can be done in only two instructions:NaN
.double <-> int64_t
, you only care about values in the range[-2^51, 2^51]
.double <-> uint64_t
, you only care about values in the range[0, 2^52)
.double -> uint64_t
double -> int64_t
uint64_t -> double
int64_t -> double
Rounding Behavior:
double -> uint64_t
conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)double -> int64_t
conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.How does it work?
Despite this trick being only 2 instructions, it's not entirely self-explanatory.
The key is to recognize that for double-precision floating-point, values in the range
[2^52, 2^53)
have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.To convert
x
fromdouble -> uint64_t
, you add the magic numberM
which is the floating-point value of2^52
. This putsx
into the "normalized" range of[2^52, 2^53)
and conveniently rounds away the fractional part bits.Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of
M
. So rather than introducing an additional mask constant, we can simply subtract or XOR byM
. XOR has more throughput.Converting from
uint64_t -> double
is simply the reverse of this process. You add back the exponent bits ofM
. Then un-normalize the number by subtractingM
in floating-point.The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.
Related: A fast method to round a double to a 32-bit int explained
Full Range int64 -> double:
After many years, I finally had a need for this.
uint64_t -> double
int64_t -> double
uint64_t -> double
int64_t -> double
These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.
These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.