n is negative, positive or zero? return 1, 2, or 4

2020-06-06 17:24发布

问题:

I'm building a PowerPC interpreter, and it works quite well. In the Power architecture the condition register CR0 (EFLAGS on x86) is updated on almost any instruction. It is set like this. The value of CR0 is 1, if the last result was negative, 2 if the last result was positive, 4 otherwise.

My first naive method to interpret this is:

if (n < 0)
    cr0 = 1
else if (n > 0)
    cr0 = 2;
else
    cr0 = 4;

However I understand that all those branches won't be optimal, being run millions of times per second. I've seen some bit hacking on SO, but none seemed adeguate. For example I found many examples to convert a number to -1, 0, or 1 accordingly to the sign or 0. But how to make -1 = 1, 1 = 2, 0 = 4? I'm asking for the help of the Bit Hackers...

Thanks in advance

Update: First of all: thanks guys, you've been great. I'll test all of your codes carefully for speed and you'll be the first to know who's the winner.

@jalf: About your first advice, I wasn't actually calculating CR0 on every instruction. I was rather keeping a lastResult variable, and when (and if) a following instruction asked for a flag, do the comparison. Three main motivations took me back to "everytime" update:

  1. On PPC you're not forced to update CR0 like on x86 (where ADD always change EFLAGS, even if not needed), you have two flavours of ADD, one updating. If the compiler chooses to use the updating one, it means that it's going to use the CR0 at some point, so there no point at delaying...
  2. There's a particularly painful instruction called mtcrf, that enables you to change the CR0 arbitrarly. You can even set it to 7, with no arithmetic meaning... This just destroys the possibility of keeping a "lastResult" variable.

回答1:

First, if this variable is to be updated after (nearly) every instruction, the obvious piece of advice is this:

don't

Only update it when the subsequent instructions need its value. At any other time, there's no point in updating it.

But anyway, when we update it, what we want is this behavior:

R < 0  => CR0 == 0b001 
R > 0  => CR0 == 0b010
R == 0 => CR0 == 0b100

Ideally, we won't need to branch at all. Here's one possible approach:

  1. Set CR0 to the value 1. (if you really want speed, investigate whether this can be done without fetching the constant from memory. Even if you have to spend a couple of instructions on it, it may well be worth it)
  2. If R >= 0, left shift by one bit.
  3. If R == 0, left shift by one bit

Where steps 2 and 3 can be transformed to eliminate the "if" part

CR0 <<= (R >= 0);
CR0 <<= (R == 0);

Is this faster? I don't know. As always, when you are concerned about performance, you need to measure, measure, measure.

However, I can see a couple of advantages of this approach:

  1. we avoid branches completely
  2. we avoid memory loads/stores.
  3. the instructions we rely on (bit shifting and comparison) should have low latency, which isn't always the case for multiplication, for example.

The downside is that we have a dependency chain between all three lines: Each modifies CR0, which is then used in the next line. This limits instruction-level parallelism somewhat.

To minimize this dependency chain, we could do something like this instead:

CR0 <<= ((R >= 0) + (R == 0));

so we only have to modify CR0 once, after its initialization.

Or, doing everything in a single line:

CR0 = 1 << ((R >= 0) + (R == 0));

Of course, there are a lot of possible variations of this theme, so go ahead and experiment.



回答2:

Lots of answers that are approximately "don't" already, as usual :) You want the bit hack? You will get it. Then feel free to use it or not as you see fit.

You could use that mapping to -1, 0 and 1 (sign), and then do this:

return 7 & (0x241 >> ((sign(x) + 1) * 4));

Which is essentially using a tiny lookup table.

Or the "naive bithack":

int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
return (~(-y >> 31) & 4) | y;

The first line maps x < 0 to 1, x > 0 to 2 and x == 0 to 0. The second line then maps y == 0 to 4 and y != 0 to y.


And of course it has a sneaky edge case for x = 0x80000000 which is mapped to 3. Oops. Well let's fix that:

int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
y &= 1 | ~(y << 1);  // remove the 2 if odd
return (~(-y >> 31) & 4) | y;


回答3:

The following expression is a little cryptic, but not excessively so, and it looks to be something the compiler can optimize pretty easily:

cr0 = 4 >> ((2 * (n < 0)) + (n > 0));

Here's what GCC 4.6.1 for an x86 target compiles it to with -O2:

xor ecx, ecx
mov eax, edx
sar eax, 31
and eax, 2
test    edx, edx
setg    cl
add ecx, eax
mov eax, 4
sar eax, cl

And VC 2010 with /Ox looks pretty similar:

xor ecx, ecx
test eax, eax
sets cl
xor edx, edx
test eax, eax
setg dl
mov eax, 4
lea ecx, DWORD PTR [edx+ecx*2]
sar eax, cl

The version using if tests compiles to assembly that uses jumps with either of these compilers. Of course, you'll never really be sure what any particular compiler is going to do with whatever particular bit of code you choose unless you actually examine the output. My expression is cryptic enough that unless it was really a performance critical bit of code, I might still go with with if statement version. Since you need to set the CR0 register frequently, I think it might be worth measuring if this expression helps at all.



回答4:

gcc with no optimization

        movl    %eax, 24(%esp)  ; eax has result of reading n
        cmpl    $0, 24(%esp)
        jns     .L2
        movl    $1, 28(%esp)
        jmp     .L3
.L2:
        cmpl    $0, 24(%esp)
        jle     .L4
        movl    $2, 28(%esp)
        jmp     .L3
.L4:
        movl    $4, 28(%esp)
.L3:

With -O2:

        movl    $1, %edx       ; edx = 1
        cmpl    $0, %eax
        jl      .L2            ; n < 0
        cmpl    $1, %eax       ; n < 1
        sbbl    %edx, %edx     ; edx = 0 or -1
        andl    $2, %edx       ; now 0 or 2
        addl    $2, %edx       ; now 2 or 4
.L2:
        movl    %edx, 4(%esp)

I don't think you are likely to do much better



回答5:

I was working on this one when my computer crashed.

int cr0 = (-(n | n-1) >> 31) & 6;
cr0 |= (n >> 31) & 5;
cr0 ^= 4;

Here's the resulting assembly (for Intel x86):

PUBLIC  ?tricky@@YAHH@Z                                 ; tricky
; Function compile flags: /Ogtpy
_TEXT   SEGMENT
_n$ = 8                                                 ; size = 4
?tricky@@YAHH@Z PROC                                    ; tricky
; Line 18
        mov     ecx, DWORD PTR _n$[esp-4]
        lea     eax, DWORD PTR [ecx-1]
        or      eax, ecx
        neg     eax
        sar     eax, 31                                 ; 0000001fH
; Line 19
        sar     ecx, 31                                 ; 0000001fH
        and     eax, 6
        and     ecx, 5
        or      eax, ecx
; Line 20
        xor     eax, 4
; Line 22
        ret     0
?tricky@@YAHH@Z ENDP                                    ; tricky

And a complete exhaustive test which is also reasonably suitable for benchmarking:

#include <limits.h>

int direct(int n)
{
    int cr0;
    if (n < 0)
        cr0 = 1;
    else if (n > 0)
        cr0 = 2;
    else
        cr0 = 4;
    return cr0;
}

const int shift_count = sizeof(int) * CHAR_BIT - 1;
int tricky(int n)
{
    int cr0 = (-(n | n-1) >> shift_count) & 6;
    cr0 |= (n >> shift_count) & 5;
    cr0 ^= 4;
    return cr0;
}

#include <iostream>
#include <iomanip>
int main(void)
{
    int i = 0;
    do {
        if (direct(i) != tricky(i)) {
            std::cerr << std::hex << i << std::endl;
            return i;
        }
    } while (++i);
    return 0;
}


回答6:

If there is a faster method, the compiler probably already is using it.

Keep your code short and simple; that makes the optimizer most effective.

The simple straightforward solution does surprisingly well speed-wise:

cr0 = n? (n < 0)? 1: 2: 4;

x86 Assembly (produced by VC++ 2010, flags /Ox):

PUBLIC  ?tricky@@YAHH@Z                                 ; tricky
; Function compile flags: /Ogtpy
_TEXT   SEGMENT
_n$ = 8                                                 ; size = 4
?tricky@@YAHH@Z PROC                                    ; tricky
; Line 26
        mov     eax, DWORD PTR _n$[esp-4]
        test    eax, eax
        je      SHORT $LN3@tricky
        xor     ecx, ecx
        test    eax, eax
        setns   cl
        lea     eax, DWORD PTR [ecx+1]
; Line 31
        ret     0
$LN3@tricky:
; Line 26
        mov     eax, 4
; Line 31
        ret     0
?tricky@@YAHH@Z ENDP                                    ; tricky


回答7:

For a completely unportable approach, I wonder if this might have any speed benefit:

void func(signed n, signed& cr0) {
    cr0 = 1 << (!(unsigned(n)>>31)+(n==0));
}

mov         ecx,eax  ;with MSVC10, all optimizations except inlining on.
shr         ecx,1Fh  
not         ecx  
and         ecx,1  
xor         edx,edx  
test        eax,eax  
sete        dl  
mov         eax,1  
add         ecx,edx  
shl         eax,cl  
mov         ecx,dword ptr [cr0]  
mov         dword ptr [ecx],eax  

compared to your code on my machine:

test        eax,eax            ; if (n < 0)
jns         func+0Bh (401B1Bh)  
mov         dword ptr [ecx],1  ; cr0 = 1;
ret                            ; cr0 = 2; else cr0 = 4; }
xor         edx,edx            ; else if (n > 0)
test        eax,eax  
setle       dl  
lea         edx,[edx+edx+2]  
mov         dword ptr [ecx],edx ; cr0 = 2; else cr0 = 4; }
ret  

I don't know much at all about assembly, so I can't say for sure if this would have any benefit (or even if mine has any jumps. I see no instructions beginning with j anyway). As always, (and as everyone else said a million times) PROFILE.

I doubt this is faster than say Jalf or Ben's, but I didn't see any that took advantage of the fact that on x86 all negative numbers have a certain bit set, and I figured I'd throw one out.

[EDIT]BenVoigt suggests cr0 = 4 >> ((n != 0) + (unsigned(n) >> 31)); to remove the logical negation, and my tests show that is a vast improvement.



回答8:

The following is my attempt.

int cro = 4 >> (((n > 0) - (n < 0)) % 3 + (n < 0)*3);