“Custom intrinsic” function for x64 instead of inl

2019-04-28 03:13发布

问题:

I am currently experimenting with the creation of highly-optimized, reusable functions for a library of mine. For instance, I write the function "is power of 2" the following way:

template<class IntType>  
inline bool is_power_of_two( const IntType x )
{
    return (x != 0) && ((x & (x - 1)) == 0);
}

This is a portable, low-maintenance implementation as an inline C++ template. This code is compiled by VC++ 2008 to the following code with branches:

is_power_of_two PROC
    test    rcx, rcx
    je  SHORT $LN3@is_power_o
    lea rax, QWORD PTR [rcx-1]
    test    rax, rcx
    jne SHORT $LN3@is_power_o
    mov al, 1
    ret 0
$LN3@is_power_o:
    xor al, al
    ret 0
is_power_of_two ENDP

I found also the implementation from here: "The bit twiddler", which would be coded in assembly for x64 as follows:

is_power_of_two_fast PROC
    test rcx, rcx
    je  SHORT NotAPowerOfTwo
    lea rax, [rcx-1]
    and rax, rcx
    neg rax
    sbb rax, rax
    inc rax
    ret
NotAPowerOfTwo:
    xor rax, rax
    ret
is_power_of_two_fast ENDP

I tested both subroutines written separately from C++ in an assembly module (.asm file), and the second one works about 20% faster!

Yet the overhead of the function call is considerable: if I compare the second assembly implementation "is_power_of_two_fast" to the inline'd-version of the template function, the latter is faster despite branches!

Unfortunately, the new conventions for x64 specify that no inline assembly is allowed. One should instead use "intrinsic functions".

Now the question: can I implement the faster version "is_power_of_two_fast" as a custom intrinsic function or something similar, so that it can be used inline? Or alternatively, is it possible to somehow force the compiler to produce the low-branch version of the function?

回答1:

Even VC 2005 is capable of producing code with sbb instruction.

for C code

bool __declspec(noinline) IsPowOf2(unsigned int a)
{
    return (a>=1)&((a&(a-1))<1);
}

compiles to the following

00401000  lea         eax,[ecx-1] 
00401003  and         eax,ecx 
00401005  cmp         eax,1 
00401008  sbb         eax,eax 
0040100A  neg         eax  
0040100C  cmp         ecx,1 
0040100F  sbb         ecx,ecx 
00401011  add         ecx,1 
00401014  and         eax,ecx 
00401016  ret          


回答2:

No, you cannot implement any custom intrinsics, they are all built into the compiler. It is not only the instructions that are built in, but the compiler also knows the semantics of the intrinsic, and adapts the code for different surrounding code.

One reason for inline assembly being removed for x86-64 is that inserting assembly into the middle of a function disturbs the optimizer, and often results in less well optimized code around the assembler code. There can easily be a net loss there!

The only real use for intrinsics are for "interesting" special instructions that the compiler cannot generate from C or C++ constructs, like BSF or BSR. Most everything else will work better using inline functions, like your template above.

If you need to do something special, that the compiler does not understand, the only real option is to write the entire function as a separate assembler module. If the call overhead for that function is too expensive, the optimization probably wasn't worth that much in the first place.

Trust your compiler(tm)!



回答3:

VC10 x64 intrinsics would not be of great help in this simple case. The dynamic branching you have is due to the && operator which is an early out operator. In many cases(your case is a perfect example) it's better to avoid branching by computing the result for all branches then apply a mask to select the good one. A cpp code with masking would look like :

template<typename T_Type>
inline bool isPowerOfTwo(T_Type const& x)
{
    // static type checking for the example
    static_assert( std::is_integral<T_Type>::value && std::is_unsigned<T_Type>::value, "limited to unsigned types for the example" );
    typedef std::make_signed<T_Type>::type s_Type;

    // same as yours but with no branching
    return bool(  ((s_Type( s_Type(x != 0) << (s_Type(sizeof(T_Type)<<3u)-1) )) >> (s_Type(s_Type(sizeof(T_Type)<<3u)-1)))  & ((x & (x - 1)) == 0)  );
}

In the above code I'm not checking if the number is negative or not for signed types. Again a simple mask will do the trick by performing an arithmetic shift to right (numBit-1) times to get a value of (~0) for negative numbers and 0 for positive ones



回答4:

The only way forward is to step back a bit and start looking at the larger picture. Either stop implementing micro-optimised API or progress onto making larger API calls all optimised in MASM64, YASM, NASM, etc.

If you use one of the more powerful assemblers you can turn the small functions into macros, so basically change your C/C++ header based inline assembler function into an assembler include file.