I want to implement unsigneda integer division by an arbitrary power of two, rounding up, efficiently. So what I want, mathematically, is ceiling(p/q)
0. In C, the strawman implementation, which doesn't take advantage of the restricted domain of q
could something like the following function1:
/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
uint64_t res = p / q;
return p % q == 0 ? res : res + 1;
}
... of course, I don't actually want to use division or mod at the machine level, since that takes many cycles even on modern hardware. I'm looking for a strength reduction that uses shifts and/or some other cheap operation(s) - taking advantage of the fact that q
is a power of 2.
You can assume we have an efficient lg(unsigned int x)
function, which returns the base-2 log of x
, if x
is a power-of-two.
Undefined behavior is fine if q
is zero.
Please note that the simple solution: (p+q-1) >> lg(q)
doesn't work in general - try it with p == 2^64-100 and q == 256
2 for example.
Platform Details
I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:
- Skylake CPU
gcc 5.4.0
with compile flags-O3 -march=haswell
Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg()
function I said was available as follows:
inline uint64_t lg(uint64_t x) {
return 63U - (uint64_t)__builtin_clzl(x);
}
inline uint32_t lg32(uint32_t x) {
return 31U - (uint32_t)__builtin_clz(x);
}
I verified that these compile down to a single bsr
instruction, at least with -march=haswell
, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.
Benchmark
I wrote a benchmark for the existing answers, and will share and update the results as changes are made.
Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop3.
You could simply avoid the whole inlining problem by ensuring your code isn't inlined: declare it in another compilation unit. I tried to that with the bench
binary, but really the results are fairly pointless. Nearly all implementations tied at 4 or 5 cycles per call, but even a dummy method that does nothing other than return 0
takes the same time. So you are mostly just measuring the call + ret
overhead. Furthermore, you are almost never really going to use the functions like this - unless you messed up, they'll be available for inlining and that changes everything.
So the two benchmarks I'll focus the most on repeatedly call the method under test in a loop, allowing inlining, cross-function optmization, loop hoisting and even vectorization.
There are two overall benchmark types: latency and throughput. The key difference is that in the latency benchmark, each call to divide
is dependent on the previous call, so in general calls cannot be easily overlapped4:
uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
uint32_t total = p;
for (unsigned i=0; i < ITERS; i++) {
total += divide_algo(total, q);
q = rotl1(q);
}
return total;
}
Note that the running total
depends so on the output of each divide call, and that it is also an input to the divide
call.
The throughput variant, on the other hand, doesn't feed the output of one divide into the subsequent one. This allows work from one call to be overlapped with a subsequent one (both by the compiler, but especially the CPU), and even allows vectorization:
uint32_t bench_divide_throughput(uint32_t p, uint32_t q) {
uint32_t total = p;
for (unsigned i=0; i < ITERS; i++) {
total += fname(i, q);
q = rotl1(q);
}
return total;
}
Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.
Furthermore, each benchmark has three flavors of behavior for the divisor, q
:
- Compile-time constant divisor. For example, a call to
divide(p, 8)
. This is common in practice, and the code can be much simpler when the divisor is known at compile time. - Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
- Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.
Combining everything you get a total of 6 distinct benchmarks.
Results
Overall
For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit)
and assigned a score of 2, 1, or 0 per subtest as follows:
- The best algorithm(s) (within 5% tolerance) receive a score of 2.
- The "close enough" algorithms (no more than 50% slower than the best) receive a score of 1.
- The remaining algorithms score zero.
Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:
╔═══════════════════════╦═══════╗
║ Algorithm ║ Score ║
╠═══════════════════════╬═══════╣
║ divide_user23_variant ║ 20 ║
║ divide_chux ║ 20 ║
║ divide_user23 ║ 15 ║
║ divide_peter ║ 14 ║
║ divide_chrisdodd ║ 12 ║
║ stoke32 ║ 11 ║
║ divide_chris ║ 0 ║
║ divide_weather ║ 0 ║
╚═══════════════════════╩═══════╝
So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0
) performs best, tied with chux's suggestion (which is in fact identical code).
Here are all the sub-scores which sum to the above:
╔══════════════════════════╦═══════╦════╦════╦════╦════╦════╦════╗
║ ║ Total ║ LC ║ LI ║ LV ║ TC ║ TI ║ TV ║
╠══════════════════════════╬═══════╬════╬════╬════╬════╬════╬════╣
║ divide_peter ║ 6 ║ 1 ║ 1 ║ 1 ║ 1 ║ 1 ║ 1 ║
║ stoke32 ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║
║ divide_chux ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_user23 ║ 8 ║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║ 1 ║
║ divide_user23_variant ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_chrisdodd ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║
║ divide_chris ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
║ divide_weather ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
║ ║ ║ ║ ║ ║ ║ ║ ║
║ 64-bit Algorithm ║ ║ ║ ║ ║ ║ ║ ║
║ divide_peter_64 ║ 8 ║ 1 ║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║
║ div_stoke_64 ║ 5 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 1 ║
║ divide_chux_64 ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_user23_64 ║ 7 ║ 1 ║ 1 ║ 2 ║ 1 ║ 1 ║ 1 ║
║ divide_user23_variant_64 ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_chrisdodd_64 ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║
║ divide_chris_64 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
║ divide_weather_64 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
╚══════════════════════════╩═══════╩════╩════╩════╩════╩════╩════╝
Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".
Analysis
At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke
and chrisdodd
). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.
Auto-vectorization
Among the fast algorithms, a large differentiator was the ability to auto-vectorize.
None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.
For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q
is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]
).
In this scenario, gcc
was able to vectorize peter
, stoke
, chux
, user23
and user23_variant
. It wasn't able to vectorize chrisdodd
for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.
Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec?
column showing whether a function vectorized or not:
Size Vec? Name
045 N bench_c_div_stoke_64
049 N bench_c_divide_chrisdodd_64
059 N bench_c_stoke32_64
212 Y bench_c_divide_chux_64
227 Y bench_c_divide_peter_64
220 Y bench_c_divide_user23_64
212 Y bench_c_divide_user23_variant_64
The trend is clear - vectorized functions take about 4x the size of the non-vectorized ones. This is both because the core loops themselves are larger (vector instructions tend to be larger and there are more of them), and because loop setup and especially the post-loop code is much larger: for example, the vectorized version requires a reduction to sum all the partial sums in a vector. The loop count is fixed and a multiple of 8, so no tail code is generated - but if were variable the generated code would be even larger.
Furthermore, despite the large improvement in runtime, gcc
's vectorization is actually poor. Here's an excerpt from the vectorized version of Peter's routine:
on entry: ymm4 == all zeros
on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ...
4007a4: c5 ed 76 c4 vpcmpeqd ymm0,ymm2,ymm4
4007ad: c5 fd df c5 vpandn ymm0,ymm0,ymm5
4007b1: c5 dd fa c0 vpsubd ymm0,ymm4,ymm0
4007b5: c5 f5 db c0 vpand ymm0,ymm1,ymm0
This chunk works independently on 8 DWORD
elements originating in ymm2
. If we take x
to be a single DWORD
element of ymm2
, and y
the incoming value of ymm1
these foud instructions correspond to:
x == 0 x != 0
x = x ? 0 : -1; // -1 0
x = x & 1; // 1 0
x = 0 - x; // -1 0
x = y1 & x; // y1 0
So the first three instructions could simple be replaced by the first one, as the states are identical in either case. So that's two cycles added to that dependency chain (which isn't loop carried) and two extra uops. Evidently gcc
's optimization phases somehow interact poorly with the vectorization code here, since such trivial optimizations are rarely missed in scalar code. Examining the other vectorized versions similarly shows a lot of performance dropped on the floor.
Branches vs Branch-free
Nearly all of the solutions compiled to branch-free code, even if C code had conditionals or explicit branches. The conditional portions were small enough that the compiler generally decided to use conditional move or some variant. One exception is chrisdodd
which compiled with a branch (checking if p == 0
) in all the throughput tests, but none of the latency ones. Here's a typical example from the constant q
throughput test:
0000000000400e60 <bench_c_divide_chrisdodd_32>:
400e60: 89 f8 mov eax,edi
400e62: ba 01 00 00 00 mov edx,0x1
400e67: eb 0a jmp 400e73 <bench_c_divide_chrisdodd_32+0x13>
400e69: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
400e70: 83 c2 01 add edx,0x1
400e73: 83 fa 01 cmp edx,0x1
400e76: 74 f8 je 400e70 <bench_c_divide_chrisdodd_32+0x10>
400e78: 8d 4a fe lea ecx,[rdx-0x2]
400e7b: c1 e9 03 shr ecx,0x3
400e7e: 8d 44 08 01 lea eax,[rax+rcx*1+0x1]
400e82: 81 fa 00 ca 9a 3b cmp edx,0x3b9aca00
400e88: 75 e6 jne 400e70 <bench_c_divide_chrisdodd_32+0x10>
400e8a: c3 ret
400e8b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
The branch at 400e76
skips the case that p == 0
. In fact, the compiler could have just peeled the first iteration out (calculating its result explicitly) and then avoided the jump entirely since after that it can prove that p != 0
. In these tests, the branch is perfectly predictable, which could give an advantage to code that actually compiles using a branch (since the compare & branch code is essentially out of line and close to free), and is a big part of why chrisdodd
wins the throughput, variable q case.
Detailed Test Results
Here you can find some detailed test results and some details on the tests themselves.
Latency
The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like 4.01
is the same as 4.00
, but the larger deviations like 5.11
seem to be real and reproducible.
The results for divide_plusq_32
use (p + q - 1) >> lg(q)
but are only shown for reference, since this function fails for large p + q
. The results for dummy
are a very simple function: return p + q
, and lets you estimate the benchmark overhead5 (the addition itself should take a cycle at most).
==============================
Bench: Compile-time constant Q
==============================
Function ns/call cycles
divide_peter_32 2.19 5.67
divide_peter_64 2.18 5.64
stoke32_32 1.93 5.00
stoke32_64 1.97 5.09
stoke_mul_32 2.75 7.13
stoke_mul_64 2.34 6.06
div_stoke_32 1.94 5.03
div_stoke_64 1.94 5.03
divide_chux_32 1.55 4.01
divide_chux_64 1.55 4.01
divide_user23_32 1.97 5.11
divide_user23_64 1.93 5.00
divide_user23_variant_32 1.55 4.01
divide_user23_variant_64 1.55 4.01
divide_chrisdodd_32 1.95 5.04
divide_chrisdodd_64 1.93 5.00
divide_chris_32 4.63 11.99
divide_chris_64 4.52 11.72
divide_weather_32 2.72 7.04
divide_weather_64 2.78 7.20
divide_plusq_32 1.16 3.00
divide_plusq_64 1.16 3.00
divide_dummy_32 1.16 3.00
divide_dummy_64 1.16 3.00
==============================
Bench: Invariant Q
==============================
Function ns/call cycles
divide_peter_32 2.19 5.67
divide_peter_64 2.18 5.65
stoke32_32 1.93 5.00
stoke32_64 1.93 5.00
stoke_mul_32 2.73 7.08
stoke_mul_64 2.34 6.06
div_stoke_32 1.93 5.00
div_stoke_64 1.93 5.00
divide_chux_32 1.55 4.02
divide_chux_64 1.55 4.02
divide_user23_32 1.95 5.05
divide_user23_64 2.00 5.17
divide_user23_variant_32 1.55 4.02
divide_user23_variant_64 1.55 4.02
divide_chrisdodd_32 1.95 5.04
divide_chrisdodd_64 1.93 4.99
divide_chris_32 4.60 11.91
divide_chris_64 4.58 11.85
divide_weather_32 12.54 32.49
divide_weather_64 17.51 45.35
divide_plusq_32 1.16 3.00
divide_plusq_64 1.16 3.00
divide_dummy_32 0.39 1.00
divide_dummy_64 0.39 1.00
==============================
Bench: Variable Q
==============================
Function ns/call cycles
divide_peter_32 2.31 5.98
divide_peter_64 2.26 5.86
stoke32_32 2.06 5.33
stoke32_64 1.99 5.16
stoke_mul_32 2.73 7.06
stoke_mul_64 2.32 6.00
div_stoke_32 2.00 5.19
div_stoke_64 2.00 5.19
divide_chux_32 2.04 5.28
divide_chux_64 2.05 5.30
divide_user23_32 2.05 5.30
divide_user23_64 2.06 5.33
divide_user23_variant_32 2.04 5.29
divide_user23_variant_64 2.05 5.30
divide_chrisdodd_32 2.04 5.30
divide_chrisdodd_64 2.05 5.31
divide_chris_32 4.65 12.04
divide_chris_64 4.64 12.01
divide_weather_32 12.46 32.28
divide_weather_64 19.46 50.40
divide_plusq_32 1.93 5.00
divide_plusq_64 1.99 5.16
divide_dummy_32 0.40 1.05
divide_dummy_64 0.40 1.04
Throughput
Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively very good for those: a fraction of a cycle in many cases. One result is that unlike most latency results, the 64-bit functions are considerably slower, since vectorization is more effective with smaller element sizes (although the gap is larger that I would have expected).
==============================
Bench: Compile-time constant Q
==============================
Function ns/call cycles
stoke32_32 0.39 1.00
divide_chux_32 0.15 0.39
divide_chux_64 0.53 1.37
divide_user23_32 0.14 0.36
divide_user23_64 0.53 1.37
divide_user23_variant_32 0.15 0.39
divide_user23_variant_64 0.53 1.37
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.34 11.23
divide_chris_64 4.34 11.24
divide_weather_32 1.35 3.50
divide_weather_64 1.35 3.50
divide_plusq_32 0.10 0.26
divide_plusq_64 0.39 1.00
divide_dummy_32 0.08 0.20
divide_dummy_64 0.39 1.00
==============================
Bench: Invariant Q
==============================
Function ns/call cycles
stoke32_32 0.48 1.25
divide_chux_32 0.15 0.39
divide_chux_64 0.48 1.25
divide_user23_32 0.17 0.43
divide_user23_64 0.58 1.50
divide_user23_variant_32 0.15 0.38
divide_user23_variant_64 0.48 1.25
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.35 11.26
divide_chris_64 4.36 11.28
divide_weather_32 5.79 14.99
divide_weather_64 17.00 44.02
divide_plusq_32 0.12 0.31
divide_plusq_64 0.48 1.25
divide_dummy_32 0.09 0.23
divide_dummy_64 0.09 0.23
==============================
Bench: Variable Q
==============================
Function ns/call cycles
stoke32_32 1.16 3.00
divide_chux_32 1.36 3.51
divide_chux_64 1.35 3.50
divide_user23_32 1.54 4.00
divide_user23_64 1.54 4.00
divide_user23_variant_32 1.36 3.51
divide_user23_variant_64 1.55 4.01
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.02 10.41
divide_chris_64 3.84 9.95
divide_weather_32 5.40 13.98
divide_weather_64 19.04 49.30
divide_plusq_32 1.03 2.66
divide_plusq_64 1.03 2.68
divide_dummy_32 0.63 1.63
divide_dummy_64 0.66 1.71
a At least by specifying unsigned we avoid the whole can of worms related to the right-shift behavior of signed integers in C and C++.
0 Of course, this notation doesn't actually work in C where /
truncates the result so the ceiling
does nothing. So consider that pseudo-notation rather than straight C.
1 I'm also interested solutions where all types are uint32_t
rather than uint64_t
.
2 In general, any p
and q
where p + q >= 2^64
causes an issue, due to overflow.
3 That said, the function should be in a loop, because the performance of a microscopic function that takes half a dozen cycles only really matters if it is called in a fairly tight loop.
4 This is a bit of a simplification - only the dividend p
is dependent on the output of the previous iteration, so some work related to processing of q
can still be overlapped.
5 Use such estimates with caution however - overhead isn't simply additive. If the overhead shows up as 4 cycles and some function f
takes 5, it's likely not accurate to say the cost of the real work in f
is 5 - 4 == 1
, because of the way execution is overlapped.
[Re-write] given OP's clarification concerning power-of-2.
The round-up or ceiling part is easy when overflow is not a concern. Simple add
q-1
, then shift.Otherwise as the possibility of rounding depends on all the bits of
p
smaller thanq
, detection of those bits is needed first before they are shifted out.This assumes code has an efficient
lg64(uint64_t x)
function, which returns the base-2 log ofx
, ifx
is a power-of-two.`This seems efficient and works for signed if your compiler is using arithmetic right shifts (usually true).
Use
>> 2
to divide by 4,>> 3
to divide by 8, &ct. Efficientlg
does the work there.You can even divide by 1!
>> 0
If the dividend/divisor can be guaranteed not to exceed 63 (or 31) bits, you can use the following version mentioned in the question. Note how p+q could overflow if they use all 64 bit. This would be fine if the SHR instruction shifted in the carry flag, but AFAIK it doesn't.
If those constraints cannot be guaranteed, you can just do the floor method and then add 1 if it would round up. This can be determined by checking if any bits in the dividend are within the range of the divisor. Note: p&-p extracts the lowest set bit on 2s complement machines or the BLSI instruction
Which clang compiles to:
That's a bit wordy and uses some newer instructions, so maybe there is a way to use the carry flag in the original version. Lets see: The RCR instruction does but seems like it would be worse ... perhaps the SHRD instruction... It would be something like this (unable to test at the moment)
1 more instruction, but should be compatible with older processors (if it works ... I'm always getting a source/destination swapped - feel free to edit)
Addendum:
The fast log functions don't fully optimize to bsr on clang and ICC but you can do this:
You can do it like this, by comparing dividing
n / d
with dividing by(n-1) / d
.Program output:
Update
I posted an early answer to the original question, which works, but did not consider the efficiency of the algorithm, or that the divisor is always a power of 2. Performing two divisions was needlessly expensive.
I am using MSVC 32-bit compiler on a 64-bit system, so there is no chance that I can provide a best solution for the required target. But it is an interesting question so I have dabbled around to find that the best solution will discover the bit of the 2**n divisor. Using the library function
log2
worked but was so slow. Doing my own shift was much better, but still my best C solution isMy inline 32-bit assembler solution is faster, but of course that will not answer the question. I steal some cycles by assuming that
eax
is returned as the function value.My old answer didn't work if
p
was one more than a power of two (whoops). So my new solution, using the__builtin_ctzll()
and__builtin_ffsll()
functions0 available ingcc
(which as a bonus, provides the fast logarithm you mentioned!):Note that this is assuming that a
long long
is 64 bits. It has to be tweaked a little otherwise.The idea here is that if we need an overflow if and only if
p
has fewer trailing zeroes thanq
. Note that for a power of two, the number of trailing zeroes is equal to the logarithm, so we can use this builtin for the log as well.The
&&lp
part is just for the corner case wherep
is zero: otherwise it will output1
here.0 Can't use
__builtin_ctzll()
for both because it is undefined ifp==0
.This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).
I did manage to get pretty good asm output from pure C for
ceil_div_andmask
, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.It looks like Chris Dodd's general idea of
return ((p-1) >> lg(q)) + 1
with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's(p >> lg(q)) + (bool)(p & (q-1))
also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.If the same
e = lg(q)
is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.Branches are cheap and very efficient if they predict very well, so branching on
d==0
will be best if it's almost never taken. Ifd==0
is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on
p==0
, but probably less likely for branching onp & (q-1)
.It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).
I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).
Branch on carry from calculating
d-1
to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars:
foo -= bar; if(foo>bar) carry_detected = 1;
)Branchless with ADC / SBB to handle the
d==0
case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to asbb rax, -1
to make CF undo the-= -1
. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.
Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate operands for ADC or SBB, and stores the carry into an integer reg between every operation. gcc7, clang3.9, and icc17 all make terrible code from this.
CMOV to fix the whole result: worse latency from q->result, since it's used sooner in the d->result dep chain.
CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if
d
is ready beforeq
, or at the same time.This doesn't achieve that goal, and needs an extra 7-byte
mov rdx,-1
in the loop.Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.
Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.
Of course, for pre-HSW, you also need to replace SHRX.
We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on
sub rdi, 1
, but otherwise the same instructions as above). See it on Godbolt.When the compiler knows that
p
is non-zero, it takes advantage and makes nice code:Chux / user23_variant
only 3c latency from p->result, and constant q can CSE a lot.
Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the
xor eax,eax
. I haven't looked at how this inlines in a loop.xor eax,eax
should speed this up to one per 1.75c (but of course any loop overhead will be part of the bottleneck, because frontend bottlenecks are like that).