Optimize for fast multiplication but slow addition

2019-01-08 00:51发布

问题:

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:

intn = 0;
for(int32_t i=0; i<maxiter; i++) {
    floatn x2 = square(x), y2 = square(y); //square(x) = x*x
    floatn r2 = x2 + y2;
    booln mask = r2<cut; //booln is in the float domain non integer domain
    if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
    n -= mask
    floatn t = x*y; mul2(t); //mul2(t): t*=2
    x = x2 - y2 + cx;
    y = t + cy;
}

This determines if n pixels are in the Mandelbrot set. So for double floating point it runs over 4 pixels (floatn = __m256d, intn = __m256i). This requires 4 SIMD floating point multiplication and four SIMD floating point additions.

Then I modified this to work with FMA like this

intn n = 0; 
for(int32_t i=0; i<maxiter; i++) {
    floatn r2 = mul_add(x,x,y*y);
    booln mask = r2<cut;
    if(!horizontal_or(mask)) break;
    add_mask(n,mask);
    floatn t = x*y;
    x = mul_sub(x,x, mul_sub(y,y,cx));
    y = mul_add(2.0f,t,cy);
}

where mul_add calls _mm256_fmad_pd and mul_sub calls _mm256_fmsub_pd. This method uses 4 FMA SIMD operations, and two SIMD multiplications which is two less arithmetic operations then without FMA. Additionally, FMA and multiplication can use two ports and addition only one.

To make my tests less biased I zoomed into a region which is entirely in the Mandelbrot set so all the values are maxiter. In this case the method using FMA is about 27% faster. That's certainly an improvement but going from SSE to AVX doubled my performance so I was hoping for maybe another factor of two with FMA.

But then I found this answer in regards to FMA where it says

The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.

and later gives an example of double*double to double-double multiplication

high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */

From this, I concluded that I was implementing FMA non-optimally and so I decided to implement SIMD double-double. I implemented double-double based on the paper Extended-Precision Floating-Point Numbers for GPU Computation. The paper is for double-float so I modified it for double-double. Additionally, instead of packing one double-double value in a SIMD registers I pack 4 double-double values into one AVX high register and one AVX low register.

For the Mandelbrot set what I really need is double-double multiplication and addition. In that paper these are the df64_add and df64_mult functions. The image below shows the assembly for my df64_mult function for software FMA (left) and hardware FMA (right). This clearly shows that hardware FMA is a big improvement for double-double multiplication.

So how does hardware FMA perform in the double-double Mandelbrot set calculation? The answer is that's only about 15% faster than with software FMA. That's much less than I hoped for. The double-double Mandelbrot calculation needs 4 double-double additions, and four double-double multiplications (x*x, y*y, x*y, and 2*(x*y)). However, the 2*(x*y) multiplication is trivial for double-double so this multiplication can be ignored in the cost. Therefore, the reason I think the improvement using hardware FMA is so small is that the calculation is dominated by the slow double-double addition (see the assembly below).

It used to be that multiplication was slower than addition (and programers used several tricks to avoid multiplication) but with Haswell it seems that it's the other way around. Not only due to FMA but also because multiplication can use two ports but addition only one.

So my questions (finally) are:

  1. How does one optimize when addition is slow compared to multiplication?
  2. Is there an algebraic way to change my algorithm to use more multiplications and less additions? I know there are method to do the reverse, e.g. (x+y)*(x+y) - (x*x+y*y) = 2*x*y which use two more additions for one less multiplication.
  3. Is there a way to simply the df64_add function (e.g. using FMA)?

In case anyone is wondering the double-double method is about ten times slower than double. That's not so bad I think as if there was a hardware quad-precision type it would likely be at least twice as slow as double so my software method is about five times slower than what I would expect for hardware if it existed.

df64_add assembly

vmovapd 8(%rsp), %ymm0
movq    %rdi, %rax
vmovapd 72(%rsp), %ymm1
vmovapd 40(%rsp), %ymm3
vaddpd  %ymm1, %ymm0, %ymm4
vmovapd 104(%rsp), %ymm5
vsubpd  %ymm0, %ymm4, %ymm2
vsubpd  %ymm2, %ymm1, %ymm1
vsubpd  %ymm2, %ymm4, %ymm2
vsubpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm2
vaddpd  %ymm5, %ymm3, %ymm1
vsubpd  %ymm3, %ymm1, %ymm6
vsubpd  %ymm6, %ymm5, %ymm5
vsubpd  %ymm6, %ymm1, %ymm6
vaddpd  %ymm1, %ymm2, %ymm1
vsubpd  %ymm6, %ymm3, %ymm3
vaddpd  %ymm1, %ymm4, %ymm2
vaddpd  %ymm5, %ymm3, %ymm3
vsubpd  %ymm4, %ymm2, %ymm4
vsubpd  %ymm4, %ymm1, %ymm1
vaddpd  %ymm3, %ymm1, %ymm0
vaddpd  %ymm0, %ymm2, %ymm1
vsubpd  %ymm2, %ymm1, %ymm2
vmovapd %ymm1, (%rdi)
vsubpd  %ymm2, %ymm0, %ymm0
vmovapd %ymm0, 32(%rdi)
vzeroupper
ret

回答1:

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware.

Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following
algorithm:
Add22 (ah ,al ,bh ,bl)
1 r = ah ⊕ bh
2 if | ah | ≥ | bh | then
3     s = ((( ah ⊖ r ) ⊕ bh ) ⊕ b l ) ⊕ a l
4 e l s e
5     s = ((( bh ⊖ r ) ⊕ ah ) ⊕ a l ) ⊕ b l
6 ( rh , r l ) = add12 ( r , s )
7 return (rh , r l)

Here is how I implemented this (pseudo-code):

static inline doubledoublen add22(doubledoublen const &a, doubledouble const &b) {
    doublen aa,ab,ah,bh,al,bl;
    booln mask;
    aa = abs(a.hi);                //_mm256_and_pd
    ab = abs(b.hi); 
    mask = aa >= ab;               //_mm256_cmple_pd
    // z = select(cut,x,y) is a SIMD version of z = cut ? x : y;
    ah = select(mask,a.hi,b.hi);   //_mm256_blendv_pd
    bh = select(mask,b.hi,a.hi);
    al = select(mask,a.lo,b.lo);
    bl = select(mask,b.lo,a.lo);

    doublen r, s;
    r = ah + bh;
    s = (((ah - r) + bh) + bl ) + al;
    return two_sum(r,s);
}

This definition of Add22 uses 11 additions instead of 20 but it requires some additional code to determine if |ah| >= |bh|. Here is a discussion on how to implement SIMD minmag and maxmag functions. Fortunately, most of the additional code does not use port 1. Now only 12 instructions go to port 1 instead of 20.

Here is a throughput analysis form IACA for the new Add22

Throughput Analysis Report
--------------------------
Block Throughput: 12.05 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 12.0 | 2.5    2.5  | 2.5    2.5  | 2.0  | 10.0 | 0.0  | 2.0  |
---------------------------------------------------------------------------------------


| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rip]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm4, ymmword ptr [rsi]
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm2, ymm4, ymm3
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm3, ymm0, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vcmppd ymm2, ymm3, ymm2, 0x2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm1, ymm0, ymm4, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm4, ymm4, ymm0, ymm2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm5, ymm0, ymm3, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm0, ymm3, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm1, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm3, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm3, ymm0
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

and here is the throughput analysis from the old

Throughput Analysis Report
--------------------------
Block Throughput: 20.00 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 20.0 | 2.0    2.0  | 2.0    2.0  | 2.0  | 0.0  | 0.0  | 2.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rsi]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm1, ymmword ptr [rdx]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm4, ymm0, ymm1
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm5, ymmword ptr [rdx+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm5, ymm5, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm4, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm4, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

A better solution would be if there were three operand single rounding mode instructions besides FMA. It seems to me there should be single rounding mode instructions for

a + b + c
a * b + c //FMA - this is the only one in x86 so far
a * b * c


回答2:

To speed up the algorithm, I use a simplified version based on 2 fma, 1 mul and 2 add. I process 8 iterations that way. Then compute the escape radius and rollback the last 8 iterations if necessary.

The following critical loop X = X^2 + C written with x86 intrinsics is nicely unrolled by the compiler, and you will spot after unrolling that the 2 FMA operations not badly dependent of each other.

//  IACA_START;
for (j = 0; j < 8; j++) {
    Xrm = _mm256_mul_ps(Xre, Xim);
    Xtt = _mm256_fmsub_ps(Xim, Xim, Cre);
    Xrm = _mm256_add_ps(Xrm, Xrm);
    Xim = _mm256_add_ps(Cim, Xrm);
    Xre = _mm256_fmsub_ps(Xre, Xre, Xtt);
}       // for
//  IACA_END;

And then I compute the escape radius (|X| < threshold), which costs an other fma and another multiplication, only every 8 iterations.

cmp = _mm256_mul_ps(Xre, Xre);
cmp = _mm256_fmadd_ps(Xim, Xim, cmp);
cmp = _mm256_cmp_ps(cmp, vec_threshold, _CMP_LE_OS);
if (_mm256_testc_si256((__m256i) cmp, vec_one)) {
    i += 8;
    continue;
}

You mention "addition is slow", this is not exactly true, but you are right, multiplication throughput get higher and higher over time on recent architectures.

Multiplication latencies and dependencies are the key. FMA has a throughput of 1 cycle and a latency of 5 cycles. the execution of independent FMA instructions can overlap.

Additions based on the result of a multiplication get the full latency hit.

So you have to break these immediate dependencies by doing "code stitching" and compute 2 points in the same loop, and just interleave the code before checking with IACA what will be going on. The following code has 2 sets of variables (suffixed by 0 and 1 for X0=X0^2+C0, X1=X1^2+C1) and starts to fill the FMA holes

for (j = 0; j < 8; j++) {
    Xrm0 = _mm256_mul_ps(Xre0, Xim0);
    Xrm1 = _mm256_mul_ps(Xre1, Xim1);
    Xtt0 = _mm256_fmsub_ps(Xim0, Xim0, Cre);
    Xtt1 = _mm256_fmsub_ps(Xim1, Xim1, Cre);
    Xrm0 = _mm256_add_ps(Xrm0, Xrm0);
    Xrm1 = _mm256_add_ps(Xrm1, Xrm1);
    Xim0 = _mm256_add_ps(Cim0, Xrm0);
    Xim1 = _mm256_add_ps(Cim1, Xrm1);
    Xre0 = _mm256_fmsub_ps(Xre0, Xre0, Xtt0);
    Xre1 = _mm256_fmsub_ps(Xre1, Xre1, Xtt1);
}       // for

To summarize,

  • you can halve the number of instructions in your critical loop
  • you can add more independent instructions and get advantage of high throughput vs. low latency of the multiplications and fused multiply and add.


回答3:

You mention the following code:

vsubpd  %ymm0, %ymm4, %ymm2
vsubpd  %ymm2, %ymm1, %ymm1  <-- immediate dependency ymm2
vsubpd  %ymm2, %ymm4, %ymm2
vsubpd  %ymm2, %ymm0, %ymm0  <-- immediate dependency ymm2
vaddpd  %ymm1, %ymm0, %ymm2  <-- immediate dependency ymm0
vaddpd  %ymm5, %ymm3, %ymm1
vsubpd  %ymm3, %ymm1, %ymm6  <-- immediate dependency ymm1
vsubpd  %ymm6, %ymm5, %ymm5  <-- immediate dependency ymm6
vsubpd  %ymm6, %ymm1, %ymm6  <-- dependency ymm1, ymm6
vaddpd  %ymm1, %ymm2, %ymm1
vsubpd  %ymm6, %ymm3, %ymm3  <-- dependency ymm6
vaddpd  %ymm1, %ymm4, %ymm2 
vaddpd  %ymm5, %ymm3, %ymm3  <-- dependency ymm3
vsubpd  %ymm4, %ymm2, %ymm4 
vsubpd  %ymm4, %ymm1, %ymm1  <-- immediate dependency ymm4
vaddpd  %ymm3, %ymm1, %ymm0  <-- immediate dependency ymm1, ymm3
vaddpd  %ymm0, %ymm2, %ymm1  <-- immediate dependency ymm0
vsubpd  %ymm2, %ymm1, %ymm2  <-- immediate dependency ymm1

if you check carefully, these are mostly dependent operations, and a basic rule about latency/throughput efficiency is not met . Most instructions depend on the outcome of the previous one, or 2 instructions before. This sequence contains a critical path of 30 cycles (about 9 or 10 instructions about "3 cycles latency"/"1 cycle throughput").

Your IACA reports "CP" => instruction in the critical path, and the evaluated cost is 20 cycles throughput. You should get the latency report because it is the one that matter if you are interested in execution speed.

To remove the cost of this critical path, you have to interleave about 20 more similar instructions if the compiler cannot do it (e.g. because your double-double code is in a separate library compiled without -flto optimizations and vzeroupper everywhere at function entry and exit, vectorizer only works well with inlined code).

A possibility is to run 2 computations in parallel (see about code stitching in a previous post to improve the pipelining)

If I assume that your double-double code looks like this "standard" implementation

// (r,e) = x + y
#define two_sum(x, y, r, e) 
    do { double t; r = x + y; t = r - x; e = (x - (r - t)) + (y - t); } while (0)
#define two_difference(x, y, r, e) \
    do { double t; r = x - y; t = r - x; e = (x - (r - t)) - (y + t); } while (0)
.....

Then you have to consider the following code, where instructions are interleaved at quite a fine grain.

// (r1, e1) = x1 + y1, (r2, e2) x2 + y2
#define two_sum(x1, y1, x2, y2, r1, e1, r2, e2) 
    do { double t1, t2 \
    r1 = x1 + y1; r2 = x2 + y2; \
    t1 = r1 - x1; t2 = r2 - x2; \
    e1 = (x1 - (r1 - t1)) + (y1 - t1); e2 = (x2 - (r2 - t2)) + (y2 - t2);  \
} while (0)
....

Then this creates code like the following one (about same critical path in a latency report, and about 35 instructions). For details about run-time, Out-Of-Order execution should fly over that without stalling.

vsubsd  %xmm2, %xmm0, %xmm8
vsubsd  %xmm3, %xmm1, %xmm1
vaddsd  %xmm4, %xmm4, %xmm4
vaddsd  %xmm5, %xmm5, %xmm5
vsubsd  %xmm0, %xmm8, %xmm9
vsubsd  %xmm9, %xmm8, %xmm10
vaddsd  %xmm2, %xmm9, %xmm2
vsubsd  %xmm10, %xmm0, %xmm0
vsubsd  %xmm2, %xmm0, %xmm11
vaddsd  %xmm14, %xmm4, %xmm2
vaddsd  %xmm11, %xmm1, %xmm12
vsubsd  %xmm4, %xmm2, %xmm0
vaddsd  %xmm12, %xmm8, %xmm13
vsubsd  %xmm0, %xmm2, %xmm11
vsubsd  %xmm0, %xmm14, %xmm1
vaddsd  %xmm6, %xmm13, %xmm3
vsubsd  %xmm8, %xmm13, %xmm8
vsubsd  %xmm11, %xmm4, %xmm4
vsubsd  %xmm13, %xmm3, %xmm15
vsubsd  %xmm8, %xmm12, %xmm12
vaddsd  %xmm1, %xmm4, %xmm14
vsubsd  %xmm15, %xmm3, %xmm9
vsubsd  %xmm15, %xmm6, %xmm6
vaddsd  %xmm7, %xmm12, %xmm7
vsubsd  %xmm9, %xmm13, %xmm10
vaddsd  16(%rsp), %xmm5, %xmm9
vaddsd  %xmm6, %xmm10, %xmm15
vaddsd  %xmm14, %xmm9, %xmm10
vaddsd  %xmm15, %xmm7, %xmm13
vaddsd  %xmm10, %xmm2, %xmm15
vaddsd  %xmm13, %xmm3, %xmm6
vsubsd  %xmm2, %xmm15, %xmm2
vsubsd  %xmm3, %xmm6, %xmm3
vsubsd  %xmm2, %xmm10, %xmm11
vsubsd  %xmm3, %xmm13, %xmm0

Summary:

  • inline your double-double source code: the compiler and vectorizer cannot optimize across function calls because of ABI constraints, and across memory accesses by fear of aliasing.

  • stitch code to balance throughput and latency and maximize the CPU ports usage (and also maximize instructions per cycle), as long as the compiler does not spill too much registers to memory.

You can track the optimization impacts with perf utility (packages linux-tools-generic and linux-cloud-tools-generic) to get the number of instructions executed and the number of instructions per cycle.