Why is pow(int, int) so slow?

I've been working on a few project Euler exercises to improve my knowledge of C++.

I've written the following function:

int a = 0,b = 0,c = 0;

for (a = 1; a <= SUMTOTAL; a++)
{
    for (b = a+1; b <= SUMTOTAL-a; b++)
    {
        c = SUMTOTAL-(a+b);

        if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
        {
            std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl;
            std::cout << a * b * c << std::endl;
        }
    }
}

This computes in 17 milliseconds.

However, if I change the line

if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)

if (c == sqrt((a*a)+(b*b)) && b < c)

the computation takes place in 2 milliseconds. Is there some obvious implementation detail of pow(int, int) that I'm missing which makes the first expression compute so much slower?

回答1:

pow() works with real floating-point numbers and uses under the hood the formula

pow(x,y) = e^(y log(x))

to calculate x^y. The int are converted to double before calling pow. (log is the natural logarithm, e-based)

x^2 using pow() is therefore slower than x*x.

Edit based on relevant comments

Using pow even with integer exponents may yield incorrect results (PaulMcKenzie)
In addition to using a math function with double type, pow is a function call (while x*x isn't) (jtbandes)
Many modern compilers will in fact optimize out pow with constant integer arguments, but this should not be relied upon.

回答2:

You've picked one of the slowest possible ways to check

c*c == a*a + b*b   // assuming c is non-negative

That compiles to three integer multiplications (one of which can be hoisted out of the loop). Even without pow(), you're still converting to double and taking a square root, which is terrible for throughput. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency isn't a factor here).

Intel Haswell's SQRTSD instruction has a throughput of one per 8-14 cycles (source: Agner Fog's instruction tables), so even if your sqrt() version keeps the FP sqrt execution unit saturated, it's still about 4 times slower than what I got gcc to emit (below).

You can also optimize the loop condition to break out of the loop when the b < c part of the condition becomes false, so the compiler only has to do one version of that check.

void foo_optimized()
{ 
  for (int a = 1; a <= SUMTOTAL; a++) {
    for (int b = a+1; b < SUMTOTAL-a-b; b++) {
        // int c = SUMTOTAL-(a+b);   // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :(
        int c = (SUMTOTAL-a) - b;
        // if (b >= c) break;  // just changed the loop condition instead

        // the compiler can hoist a*a out of the loop for us
        if (/* b < c && */ c*c == a*a + b*b) {
            // Just print a newline.  std::endl also flushes, which bloats the asm
            std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n';
            std::cout << a * b * c << '\n';
        }
    }
  }
}

This compiles (with gcc6.2 -O3 -mtune=haswell) to code with this inner loop. See the full code on the Godbolt compiler explorer.

# a*a is hoisted out of the loop.  It's in r15d
.L6:
    add     ebp, 1    # b++
    sub     ebx, 1    # c--
    add     r12d, r14d        # ivtmp.36, ivtmp.43  # not sure what this is or why it's in the loop, would have to look again at the asm outside
    cmp     ebp, ebx  # b, _39
    jg      .L13    ## This is the loop-exit branch, not-taken until the end
                    ## .L13 is the rest of the outer loop.
                    ##  It sets up for the next entry to this inner loop.
.L8:
    mov     eax, ebp        # multiply a copy of the counters
    mov     edx, ebx
    imul    eax, ebp        # b*b
    imul    edx, ebx        # c*c
    add     eax, r15d       # a*a + b*b
    cmp     edx, eax  # tmp137, tmp139
    jne     .L6
 ## Fall-through into the cout print code when we find a match
 ## extremely rare, so should predict near-perfectly

On Intel Haswell, all these instructions are 1 uop each. (And the cmp/jcc pairs macro-fuse into compare-and-branch uops.) So that's 10 fused-domain uops, which can issue at one iteration per 2.5 cycles.

Haswell runs imul r32, r32 with a throughput of one iteration per clock, so the two multiplies inside the inner loop aren't saturating port 1 at two multiplies per 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.

We're not even close to any other execution-port bottlenecks, so the front-end bottleneck is the only issue, and this should run at one iteration per 2.5 cycles on Intel Haswell and later.

Loop-unrolling could help here to reduce the number of uops per check. e.g. use lea ecx, [rbx+1] to compute b+1 for the next iteration, so we can imul ebx, ebx without using a MOV to make it non-destructive.

A strength-reduction is also possible: Given b*b we could try to compute (b-1) * (b-1) without an IMUL. (b-1) * (b-1) = b*b - 2*b + 1, so maybe we can do an lea ecx, [rbx*2 - 1] and then subtract that from b*b. (There are no addressing-modes that subtract instead of add. Hmm, maybe we could keep -b in a register, and count up towards zero, so we could use lea ecx, [rcx + rbx*2 - 1] to update b*b in ECX, given -b in EBX).

Unless you actually bottleneck on IMUL throughput, this might end up taking more uops and not be a win. It might be fun to see how well a compiler would do with this strength-reduction in the C++ source.

You could probably also vectorize this with SSE or AVX, checking 4 or 8 consecutive b values in parallel. Since hits are really rare, you just check if any of the 8 had a hit and then sort out which one it was in the rare case that there was a match.

See also the x86 tag wiki for more optimization stuff.