Efficiently dividing unsigned value by a power of

2020-02-08 11:15发布

I want to implement unsigneda integer division by an arbitrary power of two, rounding up, efficiently. So what I want, mathematically, is ceiling(p/q)0. In C, the strawman implementation, which doesn't take advantage of the restricted domain of q could something like the following function1:

/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
  uint64_t res = p / q;
  return p % q == 0 ? res : res + 1;
} 

... of course, I don't actually want to use division or mod at the machine level, since that takes many cycles even on modern hardware. I'm looking for a strength reduction that uses shifts and/or some other cheap operation(s) - taking advantage of the fact that q is a power of 2.

You can assume we have an efficient lg(unsigned int x) function, which returns the base-2 log of x, if x is a power-of-two.

Undefined behavior is fine if q is zero.

Please note that the simple solution: (p+q-1) >> lg(q) doesn't work in general - try it with p == 2^64-100 and q == 2562 for example.

Platform Details

I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:

  • Skylake CPU
  • gcc 5.4.0 with compile flags -O3 -march=haswell

Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg() function I said was available as follows:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}

I verified that these compile down to a single bsr instruction, at least with -march=haswell, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.

Benchmark

I wrote a benchmark for the existing answers, and will share and update the results as changes are made.

Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop3.

You could simply avoid the whole inlining problem by ensuring your code isn't inlined: declare it in another compilation unit. I tried to that with the bench binary, but really the results are fairly pointless. Nearly all implementations tied at 4 or 5 cycles per call, but even a dummy method that does nothing other than return 0 takes the same time. So you are mostly just measuring the call + ret overhead. Furthermore, you are almost never really going to use the functions like this - unless you messed up, they'll be available for inlining and that changes everything.

So the two benchmarks I'll focus the most on repeatedly call the method under test in a loop, allowing inlining, cross-function optmization, loop hoisting and even vectorization.

There are two overall benchmark types: latency and throughput. The key difference is that in the latency benchmark, each call to divide is dependent on the previous call, so in general calls cannot be easily overlapped4:

uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += divide_algo(total, q);                       
      q = rotl1(q);                         
    }
    return total;
  }

Note that the running total depends so on the output of each divide call, and that it is also an input to the divide call.

The throughput variant, on the other hand, doesn't feed the output of one divide into the subsequent one. This allows work from one call to be overlapped with a subsequent one (both by the compiler, but especially the CPU), and even allows vectorization:

uint32_t bench_divide_throughput(uint32_t p, uint32_t q) { 
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += fname(i, q);                     
      q = rotl1(q);                     
    }                                   
    return total;                           
  }

Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.

Furthermore, each benchmark has three flavors of behavior for the divisor, q:

  1. Compile-time constant divisor. For example, a call to divide(p, 8). This is common in practice, and the code can be much simpler when the divisor is known at compile time.
  2. Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
  3. Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.

Combining everything you get a total of 6 distinct benchmarks.

Results

Overall

For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit) and assigned a score of 2, 1, or 0 per subtest as follows:

  • The best algorithm(s) (within 5% tolerance) receive a score of 2.
  • The "close enough" algorithms (no more than 50% slower than the best) receive a score of 1.
  • The remaining algorithms score zero.

Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:

╔═══════════════════════╦═══════╗
║       Algorithm       ║ Score ║
╠═══════════════════════╬═══════╣
║ divide_user23_variant ║    20 ║
║ divide_chux           ║    20 ║
║ divide_user23         ║    15 ║
║ divide_peter          ║    14 ║
║ divide_chrisdodd      ║    12 ║
║ stoke32               ║    11 ║
║ divide_chris          ║     0 ║
║ divide_weather        ║     0 ║
╚═══════════════════════╩═══════╝

So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0) performs best, tied with chux's suggestion (which is in fact identical code).

Here are all the sub-scores which sum to the above:

╔══════════════════════════╦═══════╦════╦════╦════╦════╦════╦════╗
║                          ║ Total ║ LC ║ LI ║ LV ║ TC ║ TI ║ TV ║
╠══════════════════════════╬═══════╬════╬════╬════╬════╬════╬════╣
║ divide_peter             ║     6 ║  1 ║  1 ║  1 ║  1 ║  1 ║  1 ║
║ stoke32                  ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chux              ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_user23            ║     8 ║  1 ║  1 ║  2 ║  2 ║  1 ║  1 ║
║ divide_user23_variant    ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_chrisdodd         ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chris             ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║ divide_weather           ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║                          ║       ║    ║    ║    ║    ║    ║    ║
║ 64-bit Algorithm         ║       ║    ║    ║    ║    ║    ║    ║
║ divide_peter_64          ║     8 ║  1 ║  1 ║  1 ║  2 ║  2 ║  1 ║
║ div_stoke_64             ║     5 ║  1 ║  1 ║  2 ║  0 ║  0 ║  1 ║
║ divide_chux_64           ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_user23_64         ║     7 ║  1 ║  1 ║  2 ║  1 ║  1 ║  1 ║
║ divide_user23_variant_64 ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_chrisdodd_64      ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chris_64          ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║ divide_weather_64        ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
╚══════════════════════════╩═══════╩════╩════╩════╩════╩════╩════╝

Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".

Analysis

At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke and chrisdodd). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.

Auto-vectorization

Among the fast algorithms, a large differentiator was the ability to auto-vectorize.

None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.

For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]).

In this scenario, gcc was able to vectorize peter, stoke, chux, user23 and user23_variant. It wasn't able to vectorize chrisdodd for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.

Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec? column showing whether a function vectorized or not:

Size Vec? Name
 045    N bench_c_div_stoke_64
 049    N bench_c_divide_chrisdodd_64
 059    N bench_c_stoke32_64
 212    Y bench_c_divide_chux_64
 227    Y bench_c_divide_peter_64
 220    Y bench_c_divide_user23_64
 212    Y bench_c_divide_user23_variant_64

The trend is clear - vectorized functions take about 4x the size of the non-vectorized ones. This is both because the core loops themselves are larger (vector instructions tend to be larger and there are more of them), and because loop setup and especially the post-loop code is much larger: for example, the vectorized version requires a reduction to sum all the partial sums in a vector. The loop count is fixed and a multiple of 8, so no tail code is generated - but if were variable the generated code would be even larger.

Furthermore, despite the large improvement in runtime, gcc's vectorization is actually poor. Here's an excerpt from the vectorized version of Peter's routine:

  on entry: ymm4 == all zeros
  on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ...
  4007a4:       c5 ed 76 c4             vpcmpeqd ymm0,ymm2,ymm4
  4007ad:       c5 fd df c5             vpandn   ymm0,ymm0,ymm5
  4007b1:       c5 dd fa c0             vpsubd   ymm0,ymm4,ymm0
  4007b5:       c5 f5 db c0             vpand    ymm0,ymm1,ymm0

This chunk works independently on 8 DWORD elements originating in ymm2. If we take x to be a single DWORD element of ymm2, and y the incoming value of ymm1 these foud instructions correspond to:

                    x == 0   x != 0
x  = x ? 0 : -1; //     -1        0
x  = x & 1;      //      1        0
x  = 0 - x;      //     -1        0
x  = y1 & x;     //     y1        0

So the first three instructions could simple be replaced by the first one, as the states are identical in either case. So that's two cycles added to that dependency chain (which isn't loop carried) and two extra uops. Evidently gcc's optimization phases somehow interact poorly with the vectorization code here, since such trivial optimizations are rarely missed in scalar code. Examining the other vectorized versions similarly shows a lot of performance dropped on the floor.

Branches vs Branch-free

Nearly all of the solutions compiled to branch-free code, even if C code had conditionals or explicit branches. The conditional portions were small enough that the compiler generally decided to use conditional move or some variant. One exception is chrisdodd which compiled with a branch (checking if p == 0) in all the throughput tests, but none of the latency ones. Here's a typical example from the constant q throughput test:

0000000000400e60 <bench_c_divide_chrisdodd_32>:
  400e60:       89 f8                   mov    eax,edi
  400e62:       ba 01 00 00 00          mov    edx,0x1
  400e67:       eb 0a                   jmp    400e73 <bench_c_divide_chrisdodd_32+0x13>
  400e69:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  400e70:       83 c2 01                add    edx,0x1
  400e73:       83 fa 01                cmp    edx,0x1
  400e76:       74 f8                   je     400e70 <bench_c_divide_chrisdodd_32+0x10>
  400e78:       8d 4a fe                lea    ecx,[rdx-0x2]
  400e7b:       c1 e9 03                shr    ecx,0x3
  400e7e:       8d 44 08 01             lea    eax,[rax+rcx*1+0x1]
  400e82:       81 fa 00 ca 9a 3b       cmp    edx,0x3b9aca00
  400e88:       75 e6                   jne    400e70 <bench_c_divide_chrisdodd_32+0x10>
  400e8a:       c3                      ret    
  400e8b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

The branch at 400e76 skips the case that p == 0. In fact, the compiler could have just peeled the first iteration out (calculating its result explicitly) and then avoided the jump entirely since after that it can prove that p != 0. In these tests, the branch is perfectly predictable, which could give an advantage to code that actually compiles using a branch (since the compare & branch code is essentially out of line and close to free), and is a big part of why chrisdodd wins the throughput, variable q case.

Detailed Test Results

Here you can find some detailed test results and some details on the tests themselves.

Latency

The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like 4.01 is the same as 4.00, but the larger deviations like 5.11 seem to be real and reproducible.

The results for divide_plusq_32 use (p + q - 1) >> lg(q) but are only shown for reference, since this function fails for large p + q. The results for dummy are a very simple function: return p + q, and lets you estimate the benchmark overhead5 (the addition itself should take a cycle at most).

==============================
Bench: Compile-time constant Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.19      5.67
           divide_peter_64            2.18      5.64
                stoke32_32            1.93      5.00
                stoke32_64            1.97      5.09
              stoke_mul_32            2.75      7.13
              stoke_mul_64            2.34      6.06
              div_stoke_32            1.94      5.03
              div_stoke_64            1.94      5.03
            divide_chux_32            1.55      4.01
            divide_chux_64            1.55      4.01
          divide_user23_32            1.97      5.11
          divide_user23_64            1.93      5.00
  divide_user23_variant_32            1.55      4.01
  divide_user23_variant_64            1.55      4.01
       divide_chrisdodd_32            1.95      5.04
       divide_chrisdodd_64            1.93      5.00
           divide_chris_32            4.63     11.99
           divide_chris_64            4.52     11.72
         divide_weather_32            2.72      7.04
         divide_weather_64            2.78      7.20
           divide_plusq_32            1.16      3.00
           divide_plusq_64            1.16      3.00
           divide_dummy_32            1.16      3.00
           divide_dummy_64            1.16      3.00


==============================
Bench: Invariant Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.19      5.67
           divide_peter_64            2.18      5.65
                stoke32_32            1.93      5.00
                stoke32_64            1.93      5.00
              stoke_mul_32            2.73      7.08
              stoke_mul_64            2.34      6.06
              div_stoke_32            1.93      5.00
              div_stoke_64            1.93      5.00
            divide_chux_32            1.55      4.02
            divide_chux_64            1.55      4.02
          divide_user23_32            1.95      5.05
          divide_user23_64            2.00      5.17
  divide_user23_variant_32            1.55      4.02
  divide_user23_variant_64            1.55      4.02
       divide_chrisdodd_32            1.95      5.04
       divide_chrisdodd_64            1.93      4.99
           divide_chris_32            4.60     11.91
           divide_chris_64            4.58     11.85
         divide_weather_32           12.54     32.49
         divide_weather_64           17.51     45.35
           divide_plusq_32            1.16      3.00
           divide_plusq_64            1.16      3.00
           divide_dummy_32            0.39      1.00
           divide_dummy_64            0.39      1.00


==============================
Bench: Variable Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.31      5.98
           divide_peter_64            2.26      5.86
                stoke32_32            2.06      5.33
                stoke32_64            1.99      5.16
              stoke_mul_32            2.73      7.06
              stoke_mul_64            2.32      6.00
              div_stoke_32            2.00      5.19
              div_stoke_64            2.00      5.19
            divide_chux_32            2.04      5.28
            divide_chux_64            2.05      5.30
          divide_user23_32            2.05      5.30
          divide_user23_64            2.06      5.33
  divide_user23_variant_32            2.04      5.29
  divide_user23_variant_64            2.05      5.30
       divide_chrisdodd_32            2.04      5.30
       divide_chrisdodd_64            2.05      5.31
           divide_chris_32            4.65     12.04
           divide_chris_64            4.64     12.01
         divide_weather_32           12.46     32.28
         divide_weather_64           19.46     50.40
           divide_plusq_32            1.93      5.00
           divide_plusq_64            1.99      5.16
           divide_dummy_32            0.40      1.05
           divide_dummy_64            0.40      1.04

Throughput

Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively very good for those: a fraction of a cycle in many cases. One result is that unlike most latency results, the 64-bit functions are considerably slower, since vectorization is more effective with smaller element sizes (although the gap is larger that I would have expected).

==============================
Bench: Compile-time constant Q
==============================
                  Function         ns/call    cycles
                stoke32_32            0.39      1.00
            divide_chux_32            0.15      0.39
            divide_chux_64            0.53      1.37
          divide_user23_32            0.14      0.36
          divide_user23_64            0.53      1.37
  divide_user23_variant_32            0.15      0.39
  divide_user23_variant_64            0.53      1.37
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.34     11.23
           divide_chris_64            4.34     11.24
         divide_weather_32            1.35      3.50
         divide_weather_64            1.35      3.50
           divide_plusq_32            0.10      0.26
           divide_plusq_64            0.39      1.00
           divide_dummy_32            0.08      0.20
           divide_dummy_64            0.39      1.00


==============================
Bench: Invariant Q
==============================
                  Function         ns/call    cycles
                stoke32_32            0.48      1.25
            divide_chux_32            0.15      0.39
            divide_chux_64            0.48      1.25
          divide_user23_32            0.17      0.43
          divide_user23_64            0.58      1.50
  divide_user23_variant_32            0.15      0.38
  divide_user23_variant_64            0.48      1.25
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.35     11.26
           divide_chris_64            4.36     11.28
         divide_weather_32            5.79     14.99
         divide_weather_64           17.00     44.02
           divide_plusq_32            0.12      0.31
           divide_plusq_64            0.48      1.25
           divide_dummy_32            0.09      0.23
           divide_dummy_64            0.09      0.23


==============================
Bench: Variable Q
==============================
                  Function         ns/call    cycles
                stoke32_32            1.16      3.00
            divide_chux_32            1.36      3.51
            divide_chux_64            1.35      3.50
          divide_user23_32            1.54      4.00
          divide_user23_64            1.54      4.00
  divide_user23_variant_32            1.36      3.51
  divide_user23_variant_64            1.55      4.01
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.02     10.41
           divide_chris_64            3.84      9.95
         divide_weather_32            5.40     13.98
         divide_weather_64           19.04     49.30
           divide_plusq_32            1.03      2.66
           divide_plusq_64            1.03      2.68
           divide_dummy_32            0.63      1.63
           divide_dummy_64            0.66      1.71

a At least by specifying unsigned we avoid the whole can of worms related to the right-shift behavior of signed integers in C and C++.

0 Of course, this notation doesn't actually work in C where / truncates the result so the ceiling does nothing. So consider that pseudo-notation rather than straight C.

1 I'm also interested solutions where all types are uint32_t rather than uint64_t.

2 In general, any p and q where p + q >= 2^64 causes an issue, due to overflow.

3 That said, the function should be in a loop, because the performance of a microscopic function that takes half a dozen cycles only really matters if it is called in a fairly tight loop.

4 This is a bit of a simplification - only the dividend p is dependent on the output of the previous iteration, so some work related to processing of q can still be overlapped.

5 Use such estimates with caution however - overhead isn't simply additive. If the overhead shows up as 4 cycles and some function f takes 5, it's likely not accurate to say the cost of the real work in f is 5 - 4 == 1, because of the way execution is overlapped.

9条回答
干净又极端
2楼-- · 2020-02-08 11:46

There has already been a lot of human brainpower applied to this problem, with several variants of great answers in C along with Peter Cordes's answer which covers the best you could hope for in asm, with notes on trying to map it back to C.

So while the humans are having their kick at the can, I thought see what some brute computing power has to say! To that end, I used Stanford's STOKE superoptimizer to try to find good solutions to the 32-bit and 64-bit versions of this problem.

Usually, a superoptimizer is usually something like a brute force search through all possible instruction sequences until you find the best one by some metric. Of course, with something like 1,000 instructions that will quickly spiral out of control for more than a few instructions1. STOKE, on the hand, takes a guided randomized approach: it randomly makes mutations to an existing candidate program, evaluating at each step a cost function that takes both performance and correctness into effect. That's the one-liner anyway - there are plenty of papers if that stoked your curiosity.

So within a few minutes STOKE found some pretty interesting solutions. It found almost all the high-level ideas in the existing solutions, plus a few unique ones. For example, for the 32-bit function, STOKE found this version:

neg rsi                         
dec rdi                         
pext rax, rsi, rdi
inc eax
ret

It doesn't use any leading/trailing-zero count or shift instructions at all. Pretty much, it uses neg esi to turn the divisor into a mask with 1s in the high bits, and then pext effectively does the shift using that mask. Outside of that trick it's using the same trick that user QuestionC used: decrement p, shift, increment p - but it happens to work even for zero dividend because it uses 64-bit registers to distinguish the zero case from the MSB-set large p case.

I added the C version of this algorithm to the benchmark, and added it to the results. It's competitive with the other good algorithms, tying for first in the "Variable Q" cases. It does vectorize, but not as well as the other 32-bit algorithms, because it needs 64-bit math and so the vectors can process only half as many elements at once.

Even better, in the 32-bit case it came up with a variety of solutions which use the fact that some of the intuitive solutions that fail for edge cases happen to "just work" if you use 64-bit ops for part of it. For example:

tzcntl ebp, esi      
dec esi             
add rdi, rsi        
sarx rax, rdi, rbp
ret

That's the equivalent of the return (p + q - 1) >> lg(q) suggestion I mentioned in the question. That doesn't work in general since for large p + q it overflows, but for 32-bit p and q this solution works great by doing the important parts in 64-bit. Convert that back into C with some casts and it actually figures out that using lea will do the addition in one instruction1:

stoke_32(unsigned int, unsigned int):
        tzcnt   edx, esi
        mov     edi, edi          ; goes away when inlining
        mov     esi, esi          ; goes away when inlining
        lea     rax, [rsi-1+rdi]
        shrx    rax, rax, rdx
        ret

So it's a 3-instruction solution when inlined into something that already has the values zero-extended into rdi and rsi. The stand-alone function definition needs the mov instructions to zero-extend because that's how the SysV x64 ABI works.


For the 64-bit function it didn't come up with anything that blows away the existing solutions but it did come up with some neat stuff, like:

  tzcnt  r13, rsi      
  tzcnt  rcx, rdi      
  shrx   rax, rdi, r13 
  cmp    r13b, cl        
  adc    rax, 0        
  ret

That guy counts the trailing zeros of both arguments, and then adds 1 to the result if q has fewer trailing zeros than p, since that's when you need to round up. Clever!

In general, it understood the idea that you needed to shl by the tzcnt really quickly (just like most humans) and then came up with a ton of other solutions to the problem of adjusting the result to account for rounding. It even managed to use blsi and bzhi in several solutions. Here's a 5-instruction solution it came up with:

tzcnt r13, rsi                  
shrx  rax, rdi, r13             
imul  rsi, rax                   
cmp   rsi, rdi                    
adc   rax, 0                    
ret 

It's a basically a "multiply and verify" approach - take the truncated res = p \ q, multiply it back and if it's different than p add one: return res * q == p ? ret : ret + 1. Cool. Not really better than Peter's solutions though. STOKE seems to have some flaws in its latency calculation - it thinks the above has a latency of 5 - but it's more like 8 or 9 depending on the architecture. So it sometimes narrows in solutions that are based on its flawed latency calculation.


1 Interestingly enough though this brute force approach reaches its feasibility around 5-6 instructions: if you assume you can trim the instruction count to say 300 by eliminating SIMD and x87 instructions, then you would need ~28 days to try all 300 ^ 5 5 instruction sequences at 1,000,000 candidates/second. You could perhaps reduce that by a factor of 1,000 with various optimizations, meaning less than an hour for 5-instruction sequences and maybe a week for 6-instruction. As it happens, most of the best solutions for this problem fall into that 5-6 instruction window...

2 This will be a slow lea, however, so the sequence found by STOKE was still optimal for what I optimized for, which was latency.

查看更多
爱情/是我丢掉的垃圾
3楼-- · 2020-02-08 11:50
uint64_t exponent = lg(q);
uint64_t mask = q - 1;
//     v divide
return (p >> exponent) + (((p & mask) + mask) >> exponent)
//                       ^ round up

The separate computation of the "round up" part avoids the overflow issues of (p+q-1) >> lg(q). Depending on how smart your compiler is, it might be possible to express the "round up" part as ((p & mask) != 0) without branching.

查看更多
Fickle 薄情
4楼-- · 2020-02-08 11:50

The efficient way of dividing by a power of 2 for an unsigned integer in C is a right shift -- shifting right one divides by two (rounding down), so shifting right by n divides by 2n (rounding down).

Now you want to round up rather than down, which you can do by first adding 2n-1, or equivalently subtracting one before the shift and adding one after (except for 0). This works out to something like:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return d ? ((d-1) >> e) + 1 : 0;
}

The conditional can be removed by using the boolean value of d for addition and subtraction of one:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return ((d - !!d) >> e) + !!d;
}

Due to its size, and the speed requirement, the function should be made static inline. It probably won't make a different for the optimizer, but the parameters should be const. If it must be shared among many files, define it in a header:

static inline unsigned ceil_div(const unsigned d, const unsigned e){...
查看更多
登录 后发表回答