Demonstrator code failing to show 4 times faster S

2020-04-17 03:57发布

问题:

I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms:

Alg_A - No Vector support:

#include <stdio.h>

#define SIZE 1000000000

int main() {
    printf("Algorithm with NO vector support\n");

    int a[] = {1, 2, 3, 4};
    int b[] = {5, 6, 7, 8};
    int i = 0;

    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

    printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);
}

Alg_B - With Vector support:

#include <stdio.h>

#define SIZE 1000000000

typedef int v4_i __attribute__ ((vector_size(16)));
union Vec4 {
    v4_i v;
    int i[4];
};

int main() {
    printf("Algorithm with vector support\n\n");

    union Vec4 a, b;
    a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
    b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
    int i = 0;
    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
        a.v = a.v + b.v;
    }

    printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);
}

The compilation was done as follows:

Alg_A :

gcc -ggdb -mno-sse -mno-sse2 -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -c non_vector_support.c
gcc non_vector_support.o -o non_vector_support

Alg_B :

gcc -ggdb -c vector_support.c
gcc vector_support.o -o vector_support

Looking at the generated code for both algorithms, I can see that the compilation did not do any tricks like 'auto-vectorization' (e.g. could not see xmm registers):

Alg_A :

    for (; i < SIZE; i++) {
  74:   eb 30                   jmp    a6 <main+0xa6>
        a[0] = a[0] + b[0];
  76:   8b 55 d8                mov    -0x28(%rbp),%edx
  79:   8b 45 e8                mov    -0x18(%rbp),%eax
  7c:   01 d0                   add    %edx,%eax
  7e:   89 45 d8                mov    %eax,-0x28(%rbp)
        a[1] = a[1] + b[1];
  81:   8b 55 dc                mov    -0x24(%rbp),%edx
  84:   8b 45 ec                mov    -0x14(%rbp),%eax
  87:   01 d0                   add    %edx,%eax
  89:   89 45 dc                mov    %eax,-0x24(%rbp)
        a[2] = a[2] + b[2];
  8c:   8b 55 e0                mov    -0x20(%rbp),%edx
  8f:   8b 45 f0                mov    -0x10(%rbp),%eax
  92:   01 d0                   add    %edx,%eax
  94:   89 45 e0                mov    %eax,-0x20(%rbp)
        a[3] = a[3] + b[3];
  97:   8b 55 e4                mov    -0x1c(%rbp),%edx
  9a:   8b 45 f4                mov    -0xc(%rbp),%eax
  9d:   01 d0                   add    %edx,%eax
  9f:   89 45 e4                mov    %eax,-0x1c(%rbp)
    int a[] = {1, 2, 3, 4};
    int b[] = {5, 6, 7, 8};
    int i = 0;

    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
  a2:   83 45 d4 01             addl   $0x1,-0x2c(%rbp)
  a6:   81 7d d4 ff c9 9a 3b    cmpl   $0x3b9ac9ff,-0x2c(%rbp)
  ad:   7e c7                   jle    76 <main+0x76>
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

    printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);

Alg_B :

    for (; i < SIZE; i++) {
  74:   eb 16                   jmp    8c <main+0x8c>
        a.v = a.v + b.v;
  76:   66 0f 6f 4d d0          movdqa -0x30(%rbp),%xmm1
  7b:   66 0f 6f 45 e0          movdqa -0x20(%rbp),%xmm0
  80:   66 0f fe c1             paddd  %xmm1,%xmm0
  84:   0f 29 45 d0             movaps %xmm0,-0x30(%rbp)
    union Vec4 a, b;
    a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
    b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
    int i = 0;
    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
  88:   83 45 cc 01             addl   $0x1,-0x34(%rbp)
  8c:   81 7d cc ff c9 9a 3b    cmpl   $0x3b9ac9ff,-0x34(%rbp)
  93:   7e e1                   jle    76 <main+0x76>
        a.v = a.v + b.v;
    }

    printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);

And when I run the programs, I was hoping to see an improvement in speed by a factor of 4 however, the gain appears to be on average =~ 1s for this size of data and if increased the SIZE to around 8000000000 the gain is =~ 5s. This is the timing for the value in the above code:

Alg_A :

Algorithm with NO vector support
Running loop 1000000000 times
A: [705032705 1705032706 -1589934589 -589934588]

real    0m3.279s
user    0m3.282s
sys     0m0.000s

Alg_B :

Algorithm with vector support

Running loop 1000000000 times
A: [705032705 705032706 -1589934589 -589934588]

real    0m2.609s
user    0m2.607s
sys     0m0.004s

Counting the overhead associated to the loop. I ran the an empty loop for the given SIZE and obtained =~ 2.2s on avg. Which gives me an average speed up of around 2.5 times.

Have i missed something in the code or compilation lines? Or, is this suppose to be correct and in which case would someone know why isn't there a gain in 4 times in speed if I am exploiting 4 data points and 1 instruction at each iteration?

Thanks in advance.

回答1:

That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add.

Extend the arrays to 16 or even 32 entries long spanning over 4 or 8 SIMD vectors, and you will see huge differences thanks to instruction scheduling.

NOW: add v latency add v latency . . .

4 vector rotation: add v1 add v2 add v3 add v4 add v1 add v2 . . .

Google for "instruction scheduling" and "raw dependency" for more detailed infos.



回答2:

I put together some sample code below to illustrate how you might see the benefits of SIMD versus scalar code. The example code is a little contrived, but the main point to note is that there need to be sufficient arithmetic operations in the loop to mitigate load/store latency and loop overheads - a single add operation, as in your initial experiment, is not sufficient.

This example achieves around 4x throughput improvement for 32 bit int data. There are two versions of the SIMD loop: one simple loop with no unrolling, and an alternate loop with 2x unrolling. As might be expected the unrolled loop is a little faster.

#include <assert.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>   // gettimeofday
#include <smmintrin.h>  // SSE 4.1

static void foo_scalar(uint32_t *a, const uint32_t *b, const uint32_t *c, size_t n)
{
    for (size_t i = 0; i < n; ++i)
    {
        a[i] = (b[i] + c[i] + 1) / 2;
    }
}

static void foo_simd(uint32_t *a, const uint32_t *b, const uint32_t *c, size_t n)
{
    size_t i;

#ifndef UNROLL
    for (i = 0; i <= n - 4; i += 4)
    {
        __m128i vb = _mm_loadu_si128((__m128i *)&b[i]);
        __m128i vc = _mm_loadu_si128((__m128i *)&c[i]);
        __m128i v = _mm_add_epi32(vb, vc);
        v = _mm_add_epi32(v, _mm_set1_epi32(1));
        v = _mm_srli_epi32(v, 1);
        _mm_storeu_si128((__m128i *)&a[i], v);
    }
#else
    for (i = 0; i <= n - 8; i += 8)
    {
        __m128i vb0 = _mm_loadu_si128((__m128i *)&b[i]);
        __m128i vb1 = _mm_loadu_si128((__m128i *)&b[i + 4]);
        __m128i vc0 = _mm_loadu_si128((__m128i *)&c[i]);
        __m128i vc1 = _mm_loadu_si128((__m128i *)&c[i + 4]);
        __m128i v0 = _mm_add_epi32(vb0, vc0);
        __m128i v1 = _mm_add_epi32(vb1, vc1);
        v0 = _mm_add_epi32(v0, _mm_set1_epi32(1));
        v1 = _mm_add_epi32(v1, _mm_set1_epi32(1));
        v0 = _mm_srli_epi32(v0, 1);
        v1 = _mm_srli_epi32(v1, 1);
        _mm_storeu_si128((__m128i *)&a[i], v0);
        _mm_storeu_si128((__m128i *)&a[i + 4], v1);
    }
#endif
    foo_scalar(&a[i], &b[i], &c[i], n - i);
}

int main(int argc, char *argv[])
{
    const size_t kLoops = 100000;
    size_t n = 2 * 1024;
    struct timeval t0, t1;
    double t_scalar_ms, t_simd_ms;

    if (argc > 1)
    {
        n = atoi(argv[1]);
    }

    printf("kLoops = %zu, n = %zu\n", kLoops, n);

    uint32_t * a_scalar = malloc(n * sizeof(uint32_t));
    uint32_t * a_simd = malloc(n * sizeof(uint32_t));
    uint32_t * b = malloc(n * sizeof(uint32_t));
    uint32_t * c = malloc(n * sizeof(uint32_t));

    for (size_t i = 0; i < n; ++i)
    {
        a_scalar[i] = a_simd[i] = 0;
        b[i] = rand();
        c[i] = rand();
    }

    gettimeofday(&t0, NULL);
    for (size_t k = 0; k < kLoops; ++k)
    {
        foo_scalar(a_scalar, b, c, n);
    }
    gettimeofday(&t1, NULL);
    t_scalar_ms = ((double)(t1.tv_sec - t0.tv_sec) + (double)(t1.tv_usec - t0.tv_usec) * 1.0e-6) * 1.0e3;

    gettimeofday(&t0, NULL);
    for (size_t k = 0; k < kLoops; ++k)
    {
        foo_simd(a_simd, b, c, n);
    }
    gettimeofday(&t1, NULL);
    t_simd_ms = ((double)(t1.tv_sec - t0.tv_sec) + (double)(t1.tv_usec - t0.tv_usec) * 1.0e-6) * 1.0e3;

    int64_t sum_scalar = 0, sum_simd = 0;
    for (size_t i = 0; i < n; ++i)
    {
        sum_scalar += a_scalar[i];
        sum_simd += a_simd[i];
    }
    assert(sum_scalar == sum_simd);

    printf("t_scalar = %8g ms = %8g ns / point\n", t_scalar_ms, t_scalar_ms / (kLoops * n) * 1e6);
    printf("t_simd   = %8g ms = %8g ns / point\n", t_simd_ms, t_simd_ms / (kLoops * n) * 1e6);
    printf("Speed-up = %2.1fx\n",  t_scalar_ms / t_simd_ms);

    return 0;
}

Compile and run (no SIMD loop unrolling):

$ gcc-4.8 -fno-tree-vectorize -std=gnu99 -Wall gros_lalo.c -O3 -msse4.1 && ./a.out
kLoops = 100000, n = 2048
t_scalar =  122.668 ms = 0.598965 ns / point
t_simd   =   33.785 ms = 0.164966 ns / point
Speed-up = 3.6x

Compile and run (2x SIMD loop unrolling):

$ gcc-4.8 -fno-tree-vectorize -std=gnu99 -Wall gros_lalo.c -O3 -msse4.1 -DUNROLL && ./a.out
kLoops = 100000, n = 2048
t_scalar =  121.897 ms =   0.5952 ns / point
t_simd   =    29.07 ms = 0.141943 ns / point
Speed-up = 4.2x

It is interesting to look at the generated code:

Scalar:

    xorl    %ecx, %ecx
    .align 4
L10:
    movl    0(%rbp,%rcx,4), %esi
    addl    (%rbx,%rcx,4), %esi
    addl    $1, %esi
    shrl    %esi
    movl    %esi, (%r15,%rcx,4)
    addq    $1, %rcx
    cmpq    %r12, %rcx
    jne L10

SIMD (no unrolling):

    xorl    %ecx, %ecx
    xorl    %eax, %eax
    .align 4
L18:
    movdqu  0(%rbp,%rcx), %xmm2
    addq    $4, %rax
    movdqu  (%rbx,%rcx), %xmm1
    paddd   %xmm2, %xmm1
    paddd   %xmm3, %xmm1
    psrld   $1, %xmm1
    movdqu  %xmm1, (%r14,%rcx)
    addq    $16, %rcx
    cmpq    %r9, %rax
    jbe L18

SIMD (2x unrolling):

    xorl    %edx, %edx
    xorl    %ecx, %ecx
    .align 4
L18:
    movdqu  0(%rbp,%rdx), %xmm5
    addq    $8, %rcx
    movdqu  (%r11,%rdx), %xmm4
    movdqu  (%rbx,%rdx), %xmm2
    movdqu  (%r10,%rdx), %xmm1
    paddd   %xmm5, %xmm2
    paddd   %xmm4, %xmm1
    paddd   %xmm3, %xmm2
    paddd   %xmm3, %xmm1
    psrld   $1, %xmm2
    psrld   $1, %xmm1
    movdqu  %xmm2, 0(%r13,%rdx)
    movdqu  %xmm1, (%rax,%rdx)
    addq    $32, %rdx
    cmpq    %r15, %rcx
    jbe L18

Note that there are a similar number of instructions in the first two loops, but the SIMD loop is of course processing four elements per iteration, whereas the scalar loop is only processing one element per iteration. For the third, unrolled loop we have more instructions but we are processing eight elements per iteration - note that the proportion of loop housekeeping instructions has been reduced relative to the SIMD loop without loop unrolling.

Timing data was collected using a 2.6 GHz Core i7 Haswell CPU using gcc 4.8 on Mac OS X 10.10. Performance results should be similar on any reasonably current x86 CPU however.



回答3:

The biggest problem here is that you benchmarked with optimization disabled. GCC's default is -O0 debug-mode which keeps all variables in memory between C statements! That's generally useless and massively distorts your results by introducing a store/reload into the dependency chain from the output of one iteration to the input of the next.


Using vector operations exploits SIMD parallelism in your program. But it does not speed up the sequential parts of your program, like the time it takes to load your program or to print to the screen. This limits the maximum speedup your program can attain. This is Amdahl's law.

In addition, your x86 processor takes advantage of a parallelism even in non-SIMD code. Intel's Haswell processor has four scalar-integer ALUs, so it can do 4 adds per clock if 4 add instructions have their inputs ready that cycle.

Two of Haswell's execution ports have SIMD-integer execution units that can run paddd. But your loop only has one dependency chain for paddd, vs. four independent ones for add.

Instruction-throughput bottlenecks are also a factor: the front-end can only supply up to 4 uops per clock. All the store/reload mov instructions mean the scalar version may be bumping into that bottleneck. With 2x mov-load + add + mov-store, the front-end can only supply 1 block of 4 instructions (including 1 add) per clock cycle. But the store-forwarding bottleneck lengthens the dependency chain from 1 cycle for add on its own to about 5 or 6 cycles with add + store/reload, so those dependency chains can still overlap.


So you are comparing the execution time not for a sequential execution and a parallel execution, but of two parallel executions. One with scalar ILP and one with SIMD.

Anti-optimized debug-mode code is a huge bottleneck for your SIMD vector, too. Really it's a bigger bottleneck because there's less other work to fill that gap created by the latency. SIMD store/reload is about a cycle higher latency than scalar integer, too.

See https://stackoverflow.com/tags/x86/info and https://agner.org/optimize/ for more details. Also David Kanter's Haswell microarchitecture deep dive for some block diagrams of the CPU along with explanations.



标签: c gcc x86 sse simd