I am looking for a C++ function that returns the inverse sqrt of a float: rsqrt(x) = 1/sqrt(x)
by using the exact method like the built-in XMM operation RSQRTSS
(cf. https://www.felixcloutier.com/x86/rsqrtss). (I.e., I want the built-in approximation rather than the more precise 1/sqrtf
, and I don't care about speed (a lot).)
According to this question:
Is there a fast C or C++ standard library function for double precision inverse square root?
...there is at least no "fast way with double precision" to accomplish this with a standard C++ library. But how to do it slow, non-standard and with float
s?
There is no function in the standard library that does this, but your compiler might optimize the expression
1 / sqrt(value)
such that it does emit the RSQRTSS instruction. For example, with the compiler flags-ffast-math -march=native
, GCC will emit that instruction, see: https://godbolt.org/z/cL6seGThe
RSQRTSS
instruction is readily accessible via the_mm_rsqrt_ss()
intrinsic declared inimmintrin.h
. But we can also emulate the instruction in software, as is done in themy_rsqrtf()
function below. By simply observing the output ofRSQRTSS
one easily finds that its function values are based on a (virtual) table of 211 entries, each 12 bit in size.Note the "virtual" attribute, because it is unlikely that the hardware employs a straight 24 Kbit table. My analysis of the patterns in the table entries doesn't suggest the use of bipartite tables. A much simpler compression scheme -- like the one I used in the code below -- based on a table of base values and a table of offsets may be used. My scheme requires only a narrow adder but reduces ROM storage to 13 Kbit, i.e. almost half.
The implementation below was developed on and tested against an Intel Xeon Processor E3-1270 V2 that uses the Ivy Bridge architecture. There may be some functional differences in the implementation of
RSQRTSS
between various Intel architectures, and such differences are likely between architectures from different x86-84 vendors.The framework below checks that the emulation by
my_rsqrtf()
delivers bit-wise identical results toRSQRTSS
for all four rounding modes, two DAZ (denormals are zero) modes, and two FTZ (flush to zero) modes. We find that the function results are not affected by any modes, which matches with how Intel specifiedRSQRTSS
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual:For what it's worth, I ended up implementing it in plain assembly within C++, as @François Andrieux suggested (more precisely I used ASMJIT).
This works well, though it comes with the drawback of losing portability (less than with plain asm, though). But this is somewhat inherent to my question since I WANT to use a very specific x86 function.
Here's my code: