Why the speedup is lower than expected by using AV

2019-06-12 12:21发布

问题:

I have vectorized the the inner loop of matrix addition using intrinsics instruction of AVX2, I also have the latency table from here. I expect that speedup should be a factor of 5, because almost 4 latency happens in 1024 iterations over 6 latency in 128 iterations, but the speedup is a factor of 3. so the question is what else is here that I don't see. I'm using gcc, coding in c, intrinsics, CPU is skylake 6700hq

Here is c and assembly out put of the inner loop.

global data:

int __attribute__(( aligned(32))) a[MAX1][MAX2] ;
int __attribute__(( aligned(32))) b[MAX2][MAX3] ;
int __attribute__(( aligned(32))) c_result[MAX1][MAX3] ;

sequential :

for( i = 0 ; i < MAX1 ; i++)
        for(j = 0 ; j < MAX2 ; j++)
            c_result[i][j] = a[i][j] + b[i][j];

.L16:
    movl    (%r9,%rax), %edx           // latency : 2  , throughput : 0.5   number of execution unit : 4 ALU 
    addl    (%r8,%rax), %edx           // latency : dont know , throughput :    0.5     number of execution unit : 4 ALU 
    movl    %edx, c_result(%rcx,%rax)  // latency : 2 , throughput : 1  number of execution unit : 4 ALU 
    addq    $4, %rax
    cmpq    $4096, %rax
    jne .L16

AVX2:

for( i = 0 ; i < MAX1 ; i++){
   for(j = 0 ; j < MAX2 ; j += 8){
      a0_i= _mm256_add_epi32( _mm256_load_si256((__m256i *)&a[i][j]) ,  _mm256_load_si256((__m256i *)&b[i][j])); 
            _mm256_store_si256((__m256i *)&c_result[i][j], a0_i);
    }}

.L22:
    vmovdqa (%rcx,%rax), %ymm0           // latency : 3 , throughput : 0.5      number of execution unit : 4 ALU
    vpaddd  (%r8,%rax), %ymm0, %ymm0     // latency : dont know , throughput : 0.5  number of execution unit : 3 VEC-ALU
    vmovdqa %ymm0, c_result(%rdx,%rax)   // latency : 3 , throughput : 1    number of execution unit : 4 ALU
    addq    $32, %rax
    cmpq    $4096, %rax
    jne .L22

回答1:

Other than the loop counter, there's no loop-carried dependency chain. So operations from different loop iterations can be in flight at once. This means latency isn't the bottleneck, just throughput (of execution units, and the frontend (up to 4 fused-domain uops per clock)).

Also, your numbers are totally insane. mov loads don't take 4 ALU execution units! And the load/store latency numbers are wrong / meaningless (see the last section).

# Scalar  (serial is the wrong word.  Both versions are serial, not parallel)
.L16:
    movl    (%r9,%rax), %edx           // fused-domain uops: 1.  Unfused domain: a load port
    addl    (%r8,%rax), %edx           // fused-domain uops: 2   Unfused domain: a load port and any ALU port
    movl    %edx, c_result(%rcx,%rax)  // fused-domain uops: 2   Unfused domain: store-address and store-data ports.  port7 can't handle 2-reg addresses
    addq    $4, %rax                   // fused-domain uops: 1   unfused: any ALU
    cmpq    $4096, %rax                // fused-domain uops: 0 (fused with jcc)
    jne .L16                           // fused-domain uops: 1   unfused: port6 (predicted-taken branch)

Total: 7 fused-domain uops means the loop can issue from the loop buffer at one iteration per 2c. (not per 1.75c). Since we're using a mix of loads, stores, and ALU uops, execution ports aren't a bottleneck, just the fused-domain 4-wide issue width. Two loads per 2c and one store per 2c is only half throughput of the load and store execution units.

Note that 2-register addressing modes can't micro-fuse on Intel SnB-family. This isn't a problem for pure loads, because they're 1 uop even without micro-fusion.

The analysis is identical for the vector loop. (vpaddd has a latency of 1c on Skylake, and almost every other CPU. The table doesn't list anything in the latency column for padd with a memory operand because the latency of the load is separate from the latency of the add. It only adds one cycle to the dep chain involving the register src/dest, as long as the load address is know far enough ahead of time.)


Agner Fog's store and load latency numbers are kinda bogus, too. He arbitrarily divides the total load-store round trip latency (with store-forwarding) into a latency number for load and for store. IDK why he didn't list load latency as measured by a pointer-chasing test (e.g. repeated mov (%rsi), %rsi). That shows you that Intel SnB-family CPUs have 4 cycle load-use latency.

I meant to send him a note about that, but haven't gotten around to it.


You should be seeing an AVX2 speedup of 32/4, i.e. 8x. Your problem size is only 4096B, which is small enough for three arrays of that size to fit in L1 cache. (EDIT: the question was misleading: the loop shown is the inner loop of a nested loop. See the comments: apparently even with 4k arrays (not 4M), OP was still only seeing a 3x speedup (vs. 1.5x with 4M arrays), so there's some kind of bottleneck in the AVX version.)

All 3 arrays are aligned, so it's not cache-line crossing in the memory operand that doesn't require alignment (%r8).

My other theory on that doesn't seem very likely either, but are your array addresses offset from each other by exactly 4096B? From Agner Fog's microarch PDF:

It is not possible to read and write simultaneously from addresses that are spaced by a multiple of 4 Kbytes

The example shows a store then load, though, so IDK if that truly explains it. Even if the memory-ordering hardware thinks the load and store might be to the same address, I'm not sure why that would stop the code from sustaining as many memory ops, or why it would affect the AVX2 code worse than the scalar code.

It's worth trying offsetting your arrays from each other by an extra 128B or 256B or something.



回答2:

Following limitation restrict the performance of two implementation. First, other than the loop counter, there's no loop-carried dependency chain thus operations from different loop iterations can be performed at once and this means latency isn't the main bottleneck how ever latency is an important factor in HPC. Since, latencies are some equal, throughput of execution units is more effective for both implementations. IACA demonstrate the throughput bottleneck for scalar implementation as “Inter-Iteration” that means there is a dependency between consecutive iterations of the loop and vectorization helps make the code run faster.furthermore, vpaddd in vectorized mode can be issued on ports 5,1 but add uses execution ports 1,5,6 when port 0 is busy in the first cycle. Second, the throughput of the front-end of fused-domain may affect the performance but, in this algorithm according to the IACA results for both implementations 7 uops for each iteration needed and HSW/SKL micro-architecture can issue up to 4 fused-domain uops per clock thus it needs 2 cycle per iteration of the inner loop and this limitation violate AVX2 implementation more than scalar implementation. Third, data dependency of the algorithm cause many cache misses. By reducing the size of the matrices to be fit into the L1D(first level data cache) becomes a factor of 5(how ever I tested many time to get 5 but IDK tested again speedup is 7.3).