可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm building a PowerPC interpreter, and it works quite well. In the Power architecture the condition register CR0 (EFLAGS on x86) is updated on almost any instruction. It is set like this. The value of CR0 is 1, if the last result was negative, 2 if the last result was positive, 4 otherwise.
My first naive method to interpret this is:
if (n < 0)
cr0 = 1
else if (n > 0)
cr0 = 2;
else
cr0 = 4;
However I understand that all those branches won't be optimal, being run millions of times per second. I've seen some bit hacking on SO, but none seemed adeguate. For example I found many examples to convert a number to -1, 0, or 1 accordingly to the sign or 0. But how to make -1 = 1, 1 = 2, 0 = 4?
I'm asking for the help of the Bit Hackers...
Thanks in advance
Update:
First of all: thanks guys, you've been great. I'll test all of your codes carefully for speed and you'll be the first to know who's the winner.
@jalf: About your first advice, I wasn't actually calculating CR0 on every instruction. I was rather keeping a lastResult variable, and when (and if) a following instruction asked for a flag, do the comparison. Three main motivations took me back to "everytime" update:
- On PPC you're not forced to update CR0 like on x86 (where ADD always change EFLAGS, even if not needed), you have two flavours of ADD, one updating. If the compiler chooses to use the updating one, it means that it's going to use the CR0 at some point, so there no point at delaying...
- There's a particularly painful instruction called mtcrf, that enables you to change the CR0 arbitrarly. You can even set it to 7, with no arithmetic meaning... This just destroys the possibility of keeping a "lastResult" variable.
回答1:
First, if this variable is to be updated after (nearly) every instruction, the obvious piece of advice is this:
don't
Only update it when the subsequent instructions need its value. At any other time, there's no point in updating it.
But anyway, when we update it, what we want is this behavior:
R < 0 => CR0 == 0b001
R > 0 => CR0 == 0b010
R == 0 => CR0 == 0b100
Ideally, we won't need to branch at all. Here's one possible approach:
- Set CR0 to the value
1
. (if you really want speed, investigate whether this can be done without fetching the constant from memory. Even if you have to spend a couple of instructions on it, it may well be worth it)
- If R >= 0, left shift by one bit.
- If R == 0, left shift by one bit
Where steps 2 and 3 can be transformed to eliminate the "if" part
CR0 <<= (R >= 0);
CR0 <<= (R == 0);
Is this faster? I don't know. As always, when you are concerned about performance, you need to measure, measure, measure.
However, I can see a couple of advantages of this approach:
- we avoid branches completely
- we avoid memory loads/stores.
- the instructions we rely on (bit shifting and comparison) should have low latency, which isn't always the case for multiplication, for example.
The downside is that we have a dependency chain between all three lines: Each modifies CR0, which is then used in the next line. This limits instruction-level parallelism somewhat.
To minimize this dependency chain, we could do something like this instead:
CR0 <<= ((R >= 0) + (R == 0));
so we only have to modify CR0 once, after its initialization.
Or, doing everything in a single line:
CR0 = 1 << ((R >= 0) + (R == 0));
Of course, there are a lot of possible variations of this theme, so go ahead and experiment.
回答2:
Lots of answers that are approximately "don't" already, as usual :) You want the bit hack? You will get it. Then feel free to use it or not as you see fit.
You could use that mapping to -1, 0 and 1 (sign
), and then do this:
return 7 & (0x241 >> ((sign(x) + 1) * 4));
Which is essentially using a tiny lookup table.
Or the "naive bithack":
int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
return (~(-y >> 31) & 4) | y;
The first line maps x < 0
to 1, x > 0
to 2 and x == 0
to 0. The second line then maps y == 0
to 4 and y != 0
to y.
And of course it has a sneaky edge case for x = 0x80000000 which is mapped to 3. Oops. Well let's fix that:
int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
y &= 1 | ~(y << 1); // remove the 2 if odd
return (~(-y >> 31) & 4) | y;
回答3:
The following expression is a little cryptic, but not excessively so, and it looks to be something the compiler can optimize pretty easily:
cr0 = 4 >> ((2 * (n < 0)) + (n > 0));
Here's what GCC 4.6.1 for an x86 target compiles it to with -O2
:
xor ecx, ecx
mov eax, edx
sar eax, 31
and eax, 2
test edx, edx
setg cl
add ecx, eax
mov eax, 4
sar eax, cl
And VC 2010 with /Ox
looks pretty similar:
xor ecx, ecx
test eax, eax
sets cl
xor edx, edx
test eax, eax
setg dl
mov eax, 4
lea ecx, DWORD PTR [edx+ecx*2]
sar eax, cl
The version using if
tests compiles to assembly that uses jumps with either of these compilers. Of course, you'll never really be sure what any particular compiler is going to do with whatever particular bit of code you choose unless you actually examine the output. My expression is cryptic enough that unless it was really a performance critical bit of code, I might still go with with if
statement version. Since you need to set the CR0 register frequently, I think it might be worth measuring if this expression helps at all.
回答4:
gcc with no optimization
movl %eax, 24(%esp) ; eax has result of reading n
cmpl $0, 24(%esp)
jns .L2
movl $1, 28(%esp)
jmp .L3
.L2:
cmpl $0, 24(%esp)
jle .L4
movl $2, 28(%esp)
jmp .L3
.L4:
movl $4, 28(%esp)
.L3:
With -O2:
movl $1, %edx ; edx = 1
cmpl $0, %eax
jl .L2 ; n < 0
cmpl $1, %eax ; n < 1
sbbl %edx, %edx ; edx = 0 or -1
andl $2, %edx ; now 0 or 2
addl $2, %edx ; now 2 or 4
.L2:
movl %edx, 4(%esp)
I don't think you are likely to do much better
回答5:
I was working on this one when my computer crashed.
int cr0 = (-(n | n-1) >> 31) & 6;
cr0 |= (n >> 31) & 5;
cr0 ^= 4;
Here's the resulting assembly (for Intel x86):
PUBLIC ?tricky@@YAHH@Z ; tricky
; Function compile flags: /Ogtpy
_TEXT SEGMENT
_n$ = 8 ; size = 4
?tricky@@YAHH@Z PROC ; tricky
; Line 18
mov ecx, DWORD PTR _n$[esp-4]
lea eax, DWORD PTR [ecx-1]
or eax, ecx
neg eax
sar eax, 31 ; 0000001fH
; Line 19
sar ecx, 31 ; 0000001fH
and eax, 6
and ecx, 5
or eax, ecx
; Line 20
xor eax, 4
; Line 22
ret 0
?tricky@@YAHH@Z ENDP ; tricky
And a complete exhaustive test which is also reasonably suitable for benchmarking:
#include <limits.h>
int direct(int n)
{
int cr0;
if (n < 0)
cr0 = 1;
else if (n > 0)
cr0 = 2;
else
cr0 = 4;
return cr0;
}
const int shift_count = sizeof(int) * CHAR_BIT - 1;
int tricky(int n)
{
int cr0 = (-(n | n-1) >> shift_count) & 6;
cr0 |= (n >> shift_count) & 5;
cr0 ^= 4;
return cr0;
}
#include <iostream>
#include <iomanip>
int main(void)
{
int i = 0;
do {
if (direct(i) != tricky(i)) {
std::cerr << std::hex << i << std::endl;
return i;
}
} while (++i);
return 0;
}
回答6:
If there is a faster method, the compiler probably already is using it.
Keep your code short and simple; that makes the optimizer most effective.
The simple straightforward solution does surprisingly well speed-wise:
cr0 = n? (n < 0)? 1: 2: 4;
x86 Assembly (produced by VC++ 2010, flags /Ox
):
PUBLIC ?tricky@@YAHH@Z ; tricky
; Function compile flags: /Ogtpy
_TEXT SEGMENT
_n$ = 8 ; size = 4
?tricky@@YAHH@Z PROC ; tricky
; Line 26
mov eax, DWORD PTR _n$[esp-4]
test eax, eax
je SHORT $LN3@tricky
xor ecx, ecx
test eax, eax
setns cl
lea eax, DWORD PTR [ecx+1]
; Line 31
ret 0
$LN3@tricky:
; Line 26
mov eax, 4
; Line 31
ret 0
?tricky@@YAHH@Z ENDP ; tricky
回答7:
For a completely unportable approach, I wonder if this might have any speed benefit:
void func(signed n, signed& cr0) {
cr0 = 1 << (!(unsigned(n)>>31)+(n==0));
}
mov ecx,eax ;with MSVC10, all optimizations except inlining on.
shr ecx,1Fh
not ecx
and ecx,1
xor edx,edx
test eax,eax
sete dl
mov eax,1
add ecx,edx
shl eax,cl
mov ecx,dword ptr [cr0]
mov dword ptr [ecx],eax
compared to your code on my machine:
test eax,eax ; if (n < 0)
jns func+0Bh (401B1Bh)
mov dword ptr [ecx],1 ; cr0 = 1;
ret ; cr0 = 2; else cr0 = 4; }
xor edx,edx ; else if (n > 0)
test eax,eax
setle dl
lea edx,[edx+edx+2]
mov dword ptr [ecx],edx ; cr0 = 2; else cr0 = 4; }
ret
I don't know much at all about assembly, so I can't say for sure if this would have any benefit (or even if mine has any jumps. I see no instructions beginning with j anyway). As always, (and as everyone else said a million times) PROFILE.
I doubt this is faster than say Jalf or Ben's, but I didn't see any that took advantage of the fact that on x86 all negative numbers have a certain bit set, and I figured I'd throw one out.
[EDIT]BenVoigt suggests cr0 = 4 >> ((n != 0) + (unsigned(n) >> 31));
to remove the logical negation, and my tests show that is a vast improvement.
回答8:
The following is my attempt.
int cro = 4 >> (((n > 0) - (n < 0)) % 3 + (n < 0)*3);