I'm just doing some benchmarking and found out that fabsf()
is often like 10x slower than fabs()
. So I disassembled it and it turns out the double
version is using fabs
instruction, float
version is not. Can this be improved? This is faster, but not so much and I'm afraid it may not work, it's a little too lowlevel:
float mabs(float i)
{
(*reinterpret_cast<MUINT32*>(&i)) &= 0x7fffffff;
return i;
}
Edit: Sorry forgot about the compiler - I still use the good old VS2005, no special libs.
Did you try the
std::abs
overload forfloat
? That would be the canonical C++ way.Also as an aside, I should note that your bit-modifying version does violate the strict-aliasing rules (in addition to the more fundamental assumption that
int
andfloat
have the same size) and as such would be undefined behavior.You can easily test different possibilities using the code below. It essentially tests your bitfiddling against naive template abs, and
std::abs
. Not surprisingly, naive template abs wins. Well, kind of surprisingly it wins. I'd expectstd::abs
to be equally fast. Note that-O3
actually makes things slower (at least on coliru).Coliru's host system shows these timings:
And these timings for a Virtualbox VM running Arch with GCC 4.9 on a Core i7:
And these timings on MSVS2013 (Windows 7 x64):
If I haven't made some blatantly obvious mistake in this benchmark code (don't shoot me over it, I wrote this up in about 2 minutes), I'd say just use
std::abs
, or the template version if that turns out to be slightly faster for you.The code:
Oh, and to answer your actual question: if the compiler can't generate more efficient code, I doubt there is a faster way save for micro-optimized assembly, especially for elementary operations such as this.
There are many things at play here. First off, the x87 co-processor is deprecated in favor of SSE/AVX, so I'm surprised to read that your compiler still uses the
fabs
instruction. It's quite possible that the others who posted benchmark answers on this question use a platform that supports SSE. Your results might be wildly different.I'm not sure why your compiler uses a different logic for
fabs
andfabsf
. It's totally possible to load afloat
to the x87 stack and use thefabs
instruction on it just as easily. The problem with reproducing this by yourself, without compiler support, is that you can't integrate the operation into the compiler's normal optimizing pipeline: if you say "load this float, use thefabs
instruction, return this float to memory", then the compiler will do exactly that... and it may involve putting back to memory a float that was already ready to be processed, loading it back in, using thefabs
instruction, putting it back to memory, and loading it again to the x87 stack to resume the normal, optimizable pipeline. This would be four wasted load-store operations because it only needed to dofabs
.In short, you are unlikely to beat integrated compiler support for floating-point operations. If you don't have this support, inline assembler might just make things even slower than they presumably already are. The fastest thing for you to do might even be to use the
fabs
function instead of thefabsf
function on your floats.For reference, modern compilers and modern platforms use the SSE instructions
andps
(for floats) andandpd
(for doubles) to AND out the bit sign, very much like you're doing yourself, but dodging all the language semantics issues. They're both as fast. Modern compilers may also detect patterns likex < 0 ? -x : x
and produce the optimalandps
/andpd
instruction without the need for a compiler intrinsic.