I am currently experimenting with the creation of highly-optimized, reusable functions for a library of mine. For instance, I write the function "is power of 2" the following way:
template<class IntType>
inline bool is_power_of_two( const IntType x )
{
return (x != 0) && ((x & (x - 1)) == 0);
}
This is a portable, low-maintenance implementation as an inline C++ template. This code is compiled by VC++ 2008 to the following code with branches:
is_power_of_two PROC
test rcx, rcx
je SHORT $LN3@is_power_o
lea rax, QWORD PTR [rcx-1]
test rax, rcx
jne SHORT $LN3@is_power_o
mov al, 1
ret 0
$LN3@is_power_o:
xor al, al
ret 0
is_power_of_two ENDP
I found also the implementation from here: "The bit twiddler", which would be coded in assembly for x64 as follows:
is_power_of_two_fast PROC
test rcx, rcx
je SHORT NotAPowerOfTwo
lea rax, [rcx-1]
and rax, rcx
neg rax
sbb rax, rax
inc rax
ret
NotAPowerOfTwo:
xor rax, rax
ret
is_power_of_two_fast ENDP
I tested both subroutines written separately from C++ in an assembly module (.asm file), and the second one works about 20% faster!
Yet the overhead of the function call is considerable: if I compare the second assembly implementation "is_power_of_two_fast" to the inline'd-version of the template function, the latter is faster despite branches!
Unfortunately, the new conventions for x64 specify that no inline assembly is allowed. One should instead use "intrinsic functions".
Now the question: can I implement the faster version "is_power_of_two_fast" as a custom intrinsic function or something similar, so that it can be used inline? Or alternatively, is it possible to somehow force the compiler to produce the low-branch version of the function?
Even VC 2005 is capable of producing code with sbb instruction.
for C code
bool __declspec(noinline) IsPowOf2(unsigned int a)
{
return (a>=1)&((a&(a-1))<1);
}
compiles to the following
00401000 lea eax,[ecx-1]
00401003 and eax,ecx
00401005 cmp eax,1
00401008 sbb eax,eax
0040100A neg eax
0040100C cmp ecx,1
0040100F sbb ecx,ecx
00401011 add ecx,1
00401014 and eax,ecx
00401016 ret
No, you cannot implement any custom intrinsics, they are all built into the compiler. It is not only the instructions that are built in, but the compiler also knows the semantics of the intrinsic, and adapts the code for different surrounding code.
One reason for inline assembly being removed for x86-64 is that inserting assembly into the middle of a function disturbs the optimizer, and often results in less well optimized code around the assembler code. There can easily be a net loss there!
The only real use for intrinsics are for "interesting" special instructions that the compiler cannot generate from C or C++ constructs, like BSF or BSR. Most everything else will work better using inline functions, like your template above.
If you need to do something special, that the compiler does not understand, the only real option is to write the entire function as a separate assembler module. If the call overhead for that function is too expensive, the optimization probably wasn't worth that much in the first place.
Trust your compiler(tm)!
VC10 x64 intrinsics would not be of great help in this simple case.
The dynamic branching you have is due to the && operator which is an early out operator.
In many cases(your case is a perfect example) it's better to avoid branching by computing the result for all branches then apply a mask to select the good one. A cpp code with masking would look like :
template<typename T_Type>
inline bool isPowerOfTwo(T_Type const& x)
{
// static type checking for the example
static_assert( std::is_integral<T_Type>::value && std::is_unsigned<T_Type>::value, "limited to unsigned types for the example" );
typedef std::make_signed<T_Type>::type s_Type;
// same as yours but with no branching
return bool( ((s_Type( s_Type(x != 0) << (s_Type(sizeof(T_Type)<<3u)-1) )) >> (s_Type(s_Type(sizeof(T_Type)<<3u)-1))) & ((x & (x - 1)) == 0) );
}
In the above code I'm not checking if the number is negative or not for signed types. Again a simple mask will do the trick by performing an arithmetic shift to right (numBit-1) times to get a value of (~0) for negative numbers and 0 for positive ones
The only way forward is to step back a bit and start looking at the larger picture. Either stop implementing micro-optimised API or progress onto making larger API calls all optimised in MASM64, YASM, NASM, etc.
If you use one of the more powerful assemblers you can turn the small functions into macros, so basically change your C/C++ header based inline assembler function into an assembler include file.