How to help gcc vectorize C code

2020-06-16 04:56发布

问题:

I have the following C code. The first part just reads in a matrix of complex numbers from standard in into matrix called M. The interesting part is the second part.

#include <stdio.h>
#include <complex.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>

int main() {
    int n, m, c, d;
    float re, im;

    scanf("%d %d", &n, &m);
    assert(n==m);
    complex float M[n][n];

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
    scanf("%f%fi", &re, &im);
    M[c][d] = re + im * I;
      }
    }

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
        printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d]));
      }
      printf("\n");
    }
/*
Example:input   
2 3
1+2i 2+3i 74-4i
3+4i 4+5i -7-8i
*/
    /* Part 2. M is now an n by n matrix of complex numbers */
    int s=1, i, j;
    int *f = malloc(n * sizeof *f);
    complex float *delta = malloc(n * sizeof *delta);
    complex float *v = malloc(n * sizeof *v);
    complex float p = 1, prod;

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      f[i] = i;
      delta[i] = 1;
    }
    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }
    free(delta);
    free(f);
    free(v);
    printf("%f + i%f\n", creal(p/pow(2.,(n-1))), cimag(p/pow(2.,(n-1))));
    return 0;
}

I compile with gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm. This explains to me why almost none of the loops are being vectorized.

The most important part for performance is lines 47--50 which are:

for (i = 0; i < n; i++) {
    v[i] -= 2.*delta[j]*M[j][i];
    prod *= v[i];
}

gcc tells me:

permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: Unsupported pattern.
permanent-in-c.c:47:7: note: not vectorized: unsupported use in stmt.
permanent-in-c.c:47:7: note: unexpected pattern.
[...]
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: IMAGPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: REALPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
[...]
permanent-in-c.c:48:26: note: Build SLP failed: unrolling required in basic block SLP
permanent-in-c.c:48:26: note: Failed to SLP the basic block.
permanent-in-c.c:48:26: note: not vectorized: failed to find SLP opportunities in basic block.

How can I fix the problems that are stopping this part from being vectorized?


Curiously this part is vectorized but I am not sure why:

for (j = 0; j <n; j++) {
    v[i] += M[j][i];

The full output of gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm is at https://bpaste.net/show/18ebc3d66a53.

回答1:

Let's examine the code in detail, first. We have

complex float  M[rows][cols];
complex float  v[cols];
float          delta[rows];
complex float  p = 1.0f;
float          s = 1.0f;

While delta[] is of complex float type in OP's code, it only ever contains -1.0f or +1.0f. (Furthermore, the calculations could be simplified if it were -2.0f or +2.0f instead.) For that reason, I defined is as real and not complex.

Similarly, OP defines s as int, but uses it effectively as -1.0f and +1.0f only (in the calculations). That's why I defined it explicitly as float.

I omit the f array, because there is a trivial way of avoiding it entirely.

The first loop of the interesting part of the code,

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      delta[i] = 1;
    }

performs several things. It initializes all elements in the delta[] array to 1; it could (and probably should) be split into a separate loop.

Since the outer loop is increasing in i, p will be the product of the elements in v; it could be split off into a separate loop, too.

Because the inner loop sums all the elements in column i to v[i], the outer and inner loops simply sums each row, as a vector, to vector v.

We can thus rewrite the above, in pseudocode, as

Copy first row of matrix M to vector v
For r = 1 .. rows-1:
    Add complex values in row r of matrix M to vector v

p = product of complex elements in vector v

delta = 1.0f, 1.0f, 1.0f, .., 1.0f, 1.0f

 
Let's next look at the second nested loop:

    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }

It is hard to see unless you examine the values of j as the loop progresses, but the last 4 lines in the body of the outer loop implement the OEIS A007814 integer sequence in j (0,1,0,2,0,1,0,3,0,1,0,2,0,1,0,4,...). The number of iterations in this loop is 2rows-1-1. This part of the sequence is symmetric, and implements a binary tree of height rows-1:

               4
       3               3
   2       2       2       2     (Read horizontally)
 1   1   1   1   1   1   1   1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

It turns out that if we loop over i = 1 .. 2rows-1, then r is the number of zero low bits in i. GCC provides a __builtin_ctz() built-in function, that computes exactly this. (Note that __builtin_ctz(0) yields an undefined value; so don't do that, even if it happens to produce a specific value on your computer.)

The inner loop subtracts the complex values on row j of the matrix, scaled by 2*delta[j], from vector v[]. It also calculates the product of the complex entries in vector v[] (after the subtraction) into variable prod.

After the inner loop, delta[j] is negated, as is scale factor s. The value of variable prod, scaled by s, is added to p.

After the loop, the final (complex) result is p divided by 2rows-1. This is better done using the ldexp() C99 function (separately on the real and complex parts).

We can therefore rewrite the second nested loop, in pseudocode, as

s = 1.0f

For k = 1 .. rows-1, inclusive:
    r = __builtin_ctz(k), i.e. number of least
                          significant bits that
                          are zero in k

    Subtract the complex values on row r of matrix M,
        scaled by delta[r], from vector v[]

    prod = the product of (complex) elements in vector v[]

    Negate scale factor s (changing its sign)

    Add prod times s to result p

In my experience, it is best to split the real and imaginary parts into separate vectors and matrices. Consider the following definitions:

typedef struct {
    float  *real;
    float  *imag;
    size_t  floats_per_row;  /* Often 'stride' */
    size_t  rows;
    size_t  cols;
} complex_float_matrix;

/* Set an array of floats to a specific value */
void float_set(float *, float, size_t);

/* Copy an array of floats */
void float_copy(float *, const float *, size_t);

/* malloc() vector-aligned memory; size in floats */
float *float_malloc(size_t);

/* Elementwise addition of floats */
void float_add(float *, const float *, size_t);

/* Elementwise addition of floats, scaled by a real scale factor */
void float_add_scaled(float *, const float *, float, size_t);

/* Complex float product, separate real and imag arrays */
complex float complex_float_product(const float *, const float *, size_t);

All of the above are easily vectorized, as long as float_malloc() yields sufficiently aligned pointers (and the compiler is told that, via e.g. GCC function attribute __attribute__ ((__assume_aligned__ (BYTES_IN_VECTOR)));), and the floats_per_row member in the matrix is a multiple of the number of floats in a vector.

(I do not know whether GCC can automatically vectorize the above functions, but I do know that they can be vectorized "by hand" using the GCC vector extensions.)

With the above, the entire interesting part of the code, in pseudo-C, becomes

complex float permanent(const complex_float_matrix *m)
{
    float         *v_real, *v_imag;
    float         *scale;   /* OP used 'delta' */
    complex float  result;  /* OP used 'p'     */
    complex float  product; /* OP used 'prod'  */
    float          coeff = 1.0f; /* OP used 's' */
    size_t         i = 1 << (m->rows - 1);
    size_t         r;

    if (!m || m->cols < 1 || m->rows < 1 || !i) {
        /* TODO: No input matrix, or too many rows in input matrix */
    }

    v_real = float_malloc(m->cols);
    v_imag = float_malloc(m->cols);
    scale  = float_malloc(m->rows - 1);
    if (!v_real || !v_imag || !scale) {
        free(scale);
        free(v_imag);
        free(v_real);
        /* TODO: Out of memory error */
    }

    float_set(scale, 2.0f, m->rows - 1);

    /* Sum matrix rows to v. */

    float_copy(v_real, m->real, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_real, m->real + r * m->floats_per_row, m->cols);

    float_copy(v_imag, m->imag, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_imag, m->imag + r * m->floats_per_row, m->cols);

    result = complex_float_product(v_real, v_imag, m->cols);

    while (--i) {
        r = __builtin_ctz(i);

        scale[r] = -scale[r];

        float_add_scaled(v_real, m->real + r * m->floats_per_row, m->cols);
        float_add_scaled(v_imag, m->imag + r * m->floats_per_row, m->cols);

        product = complex_float_product(v_real, v_imag, m->cols);

        coeff = -coeff;

        result += coeff * product;
    }

    free(scale);
    free(v_imag);
    free(v_real);

    return result;
}

At this point, I would personally implement the above without vectorization, then test it extensively, until I was sure it works correctly.

Then, I'd examine the GCC assembly output (-S) to see if it can sufficiently vectorize the individual operations (the functions I listed earlier).

Hand-vectorizing the functions using GCC vector extensions is quite straightforward. Declaring a float vector is trivial:

typedef  float  vec2f __attribute__((vector_size (8), aligned (8)));  /* 64 bits; MMX, 3DNow! */
typedef  float  vec4f __attribute__((vector_size (16), aligned (16))); /* 128 bits; SSE */
typedef  float  vec8f __attribute__((vector_size (32), aligned (32))); /* 256 bits; AVX, AVX2 */
typedef  float  vec16f __attribute__((vector_size (64), aligned (64))); /* 512 bits; AVX512F */

The individual components in each vector can be addressed using the array notation (v[0] and v[1] for vec2f v;). GCC can do basic operations on entire vectors element-wise; we only really need addition and multiplication here. Horizontal operations (operations that apply between elements in the same vector) should be avoided, and instead elements reordered.

GCC will generate working code for the above vector sizes even on architectures without such vectorization, but the resulting code may be slow. (GCC versions up to 5.4 at least will generate a lot of unnecessary moves, typically to stack and back.)

The memory allocated for a vector needs to be sufficiently aligned. malloc() does not provide sufficiently aligned memory in all cases; you ought to use posix_memalign() instead. The aligned attribute can be used to increase the alignment GCC uses for the vector type, when allocating one locally or statically. In a matrix, you need to ensure that rows start at a sufficiently aligned boundary; that's the reason I have the floats_per_row variable in the structure.

In cases where the number of elements in a vector (or row) is large, but not a multiple of the number of floats in a vector, you should pad the vector with "inert" values -- values that do not affect the result, like 0.0f for addition and subtraction, and 1.0f for multiplication.

At least on x86 and x86-64, GCC will generate better code for loops using pointers only. For example, this

void float_set(float *array, float value, size_t count)
{
    float *const limit = array + count;
    while (array < limit)
        *(array++) = value;
}

yields better code than

void float_set(float *array, float value, size_t count)
{
    size_t i;
    for (i = 0; i < count; i++)
        array[i] = value;
}

or

void float_set(float *array, float value, size_t count)
{
    while (count--)
        *(array++) = value;
}

(which tend to produce similar code). Personally, I'd implement it as

void float_set(float *array, float value, size_t count)
{
    if (!((uintptr_t)array & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)array, 8);
        uint64_t        val;

        __builtin_memcpy(&val, &value, 4);
        __builtin_memcpy(4 + (char *)&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t        val;

        __builtin_memcpy(&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    }
}

and float_copy() as

void float_copy(float *target, const float *source, size_t count)
{
    if (!((uintptr_t)source & 7) &&
        !((uintptr_t)target & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)target, 8);
        uint64_t       *src = (uint64_t *)__builtin_assume_aligned((void *)source, 8);

        while (ptr < end)
            *(ptr++) = *(src++);

    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t       *src = (uint32_t *)__builtin_assume_aligned((void *)source, 4);

        while (ptr < end)
            *(ptr++) = *(src++);
    }
}

or something close to that.

The hardest to vectorize is complex_float_product(). If you fill in the unused elements in the final vector with 1.0f for the real part and 0.0f for the imaginary part, you can easily compute the complex product for each vector. Remember that

      (a + b i) × (c + d i) = (a c - b d) + (a d + b c) i

The hard part here is to efficiently calculate the complex product for the elements in the vector. Fortunately, that part is not at all critical for overall performance (except for very short vectors, or matrices with very few columns), so it should not matter much in practice.

(In short, the "hard" part is to find the way to reorder the elements to maximally use the packed vector multiplication, and not need so many shuffles/moves to slow it down.)

For the float_add_scaled() function, you should create a vector filled with the scale factor; something like the following,

void float_add_scaled(float *array, const float *source, float scale, size_t count)
{
    const vec4f  coeff = { scale, scale, scale, scale };
    vec4f       *ptr = (vec4f *)__builtin_assume_aligned((void *)array, 16);
    vec4f *const end = (vec4f *)__builtin_assume_aligned((void *)(array + count), 16);
    const vec4f *src = (vec4f *)__builtin_assume_aligned((void *)source, 16);

    while (ptr < end)
        *(ptr++) += *(src++) * coeff;
}

if we ignore the alignment and size checks, and the fallback implementation.



回答2:

I think I might have figured it out. After a lot of trial/error, it became clear that gcc built in vectorization optimizations are sort of hard coded and it doesn't 'understand' complex numbers properly. I made some changes in the code and got your inner performance sensitive loop to vectorize, confirmed by gcc output (though I am not sure the desired outcome is computationally equivalent to what you want). While my understanding is limited to what you want the code to do, the finding is that it'll work fine if you compute real and imag separately. Have a look:

    float t_r = 0.0, t_im = 0.0; // two new temporaries  
    while (j < n-1) {
        prod = 1.;
        for (i = 0; i < n; i++) {
// fill the temps after subtraction from V to avoid stmt error
            t_r = creal (v[i]) - (2. * creal(delta[j]) * creal (M[j][i]));
            t_im = cimag(v[i]) - (2. * cimag(delta[j]) * cimag (M[j][i])) * I;
            //v[i] = 2.*delta[j]*M[j][i];
            v[i] = t_r + t_im; // sum of real and img
            prod *= v[i];
        }
        delta[j] = -delta[j];
        s = -s;            
        p += s*prod;
        f[0] = 0;
        f[j] = f[j+1];
        f[j+1] = j+1;
        j = f[0];
    }


回答3:

Optimizer logs indicate clearly

Unknown alignment for access:...

when trying to vectorize

printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d])); //24
v[i] += M[j][i]; //38
p *= v[i]; //40
v[i] -= 2.*delta[j]*M[j][i]; //48

It seems really linked you need to force alignment of your arrays M, delta and v in memory.

Auto-vectorization in GCC

Handling of aligned memory accesses only (do not attempt to vectorize loops that contain unaligned accesses)

As mentionned in previous comments, I would suggest you use posix_memalign for that purpose.

complex float * restrict delta;
posix_memalign(&delta, 64, n * sizeof *delta); //to adapt

What is your target environment? (OS, CPU)

Please have a look at data-alignment-to-assist-vectorization