When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admittedly trivial) code:
int32_t __attribute__((noinline)) negate(int32_t num) {
return -num;
}
int32_t f(int32_t num) {
int32_t x = num < 0 ? negate(num) : num;
return 2*x + 1;
}
Both GCC and clang compile to essentially the following:
.global _f
_f:
cmp edi, 0
jg after_call
call _negate
after_call:
lea rax, [rax*2+1]
ret
This got me thinking: what if x86 had a conditional call instruction like ARM? Imagine if there was such an instruction "ccallcc" with semantics like cmovcc. Then you could do something like:
.global _f
_f:
cmp edi, 0
ccalll _negate
lea rax, [rax*2+1]
ret
Although we can't avoid the branch prediction, we do eliminate a branch. Namely, in the actual GCC/clang output, we are forced to branch regardless of whether num < 0
or not. And if num < 0
we have to branch twice. This seems wasteful.
Now such an instruction doesn't exist in amd64, but I devised a way to simulate such an instruction. I did this by breaking call func
into its component parts: push rip
(well technically [rip+label_after_call_instruction]
) and then jmp func
. We can make the jmp
conditional, but there is no conditional push
. We can simulate this by computing [rip+label_after_call_instruction]
and writing it to the appropriate location on the stack, then conditionally updating rsp
if we plan to call the function (which actually "pushes" the [rip+label_after_call_instruction]
). It looks something like this:
.global _f
_f:
cmp edi, 0
# ccalll _negate
lea rax, [rip+after_ccall] # Compute return address
mov [rsp-8], rax # Prepare to "push" return address
lea rax, [rsp-8] # Compute rsp (after push)
cmovl rsp, rax # Conditionally push (by actually changing rsp)
jl _negate # "Conditional call"
after_ccall:
lea rax, [rax*2+1]
ret
There are a few potential downsides to this approach:
- It introduces several instructions (but they total less cycles than the branch mispredict penalty)
- It requires a write to memory (but the stack probably is cached?)
- It always executes the 2
lea
s andmov
even if the call isn't made (but my understanding is this doesn't matter as cmovcc takes the same number of cycles as mov, for example)
To examine the properties of each of these approaches, I ran the critical sections through iaca
. If you have it installed (and you clone my benchmark gist below), you can run make iaca
to see for yourself. Pass IACAFLAGS='-arch=...'
to specify a different arch.
The output for the branch over approach:
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File - ./branch_over_call_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 0.82 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 36
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.0 | 0.3 0.0 | 0.3 0.0 | 1.0 | 0.0 | 0.5 | 0.3 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 0.5 | | | | | | 0.5 | | jnle 0x6
| 4^# | | | 0.3 | 0.3 | 1.0 | | | 0.3 | call 0x5
Total Num Of Uops: 5
And the output for the conditional call approach:
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File - ./conditional_call_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.94 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 35
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 0.5 0.0 | 0.5 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | 1.0 | | | | | | | lea rax, ptr [rip]
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov qword ptr [rsp-0x8], rax
| 1 | | | | | | 1.0 | | | lea rax, ptr [rsp-0x8]
| 1 | 1.0 | | | | | | | | cmovl rsp, rax
| 1 | | | | | | | 1.0 | | jl 0x6
Total Num Of Uops: 6
I looks like the conditional call approach appears to use more of the hardware. But I found it interesting that the conditional approach only has 1 more uop (the branch over approach had 5 uops). I guess this makes sense given that under the hood the call turns into a push and jmp (and the push turns into rsp math and a memory mov). This would suggest to me that the conditional call approach is approximately equivalent (although maybe my simplistic analysis is flawed here?).
At the least, my overarching suspicion that was by introducing several instructions between the cmp
and jl
, I'd make it possible that the result of the cmp
would be available before the jl
can be speculatively executed (thus preventing the branch prediction at all). Although maybe the pipeline is longer than this? This treads into areas with which (despite having read and retained a medium-depth understanding of Agner Fog's optimization manuals) I am not very familiar.
My hypothesis is that for a uniform distribution of (negative and positive) num
s (where branch prediction won't be able to predict the branch around the call
) that my "conditional call" approach will outperform branching around the call.
I wrote a harness to benchmark the performance of these two approaches. You can git clone https://gist.github.com/baileyparker/8a13c22d0e26396921f501fe87f166a9
and make
to run the benchmarks on your machine.
Here's the runtime of 100 iterations of each approach on an array of 1,048,576 numbers (uniformly distributed between int32_t
min and max).
| CPU | Conditional Call | Branch Over |
|-------------------------------------------|-----------------:|------------:|
| Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz | 10.9872 ms | 8.4602 ms |
| Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz | 8.8132 ms | 7.0704 ms |
These results are consistent across runs and although magnified by increasing the array size (or number of iterations), branching over always wins.
I also tried reordering the conditional call steps (computing and conditionaly updating rsp
first, then writing to the stack) but this performed similarly.
What hardware detail that am I missing (or misunderstanding) explains this? From my calculations the extra instructions add somewhere around 6-7 cycles, but a branch mispredict costs 15. So, on average half the numbers are predicted wrong so each iteration costs 15/2 cycles (for the branch over approach) and always 6-7 cycles for the conditional call. The uops from iaca suggest the approaches are even closer in this regard. So, shouldn't the performance be closer? Is my example code too contrived/short? Is my benchmarking technique not appropriate for this kind of low level critical section testing? Is there a way to reorder/change the conditional call to make it more performant (better or comparable to the branch over approach, maybe)?
tl;dr Why does my conditional call code (4th code snippet) perform worse than what gcc/clang produces (conditional jump over the call
) (2nd code snippet) (for the code in the 1st snippet) on my benchmark?