-O2 in ICC messes up assembler, fine with -O1 in I

I was recently starting to use ICC (18.0.1.126) to compile a code that worked fine with GCC and Clang on arbitrary optimization settings. The code contains an assembler routine that multiplies 4x4 matrices of doubles using AVX2 and FMA instructions. After much fiddling it turned out that the assembler routine is working properly when compiled with -O1 - xcore-avx2, but gives a wrong numerical result when compiled with -O2 - xcore-avx2. The code compiles however without any error messages on all optimization settings. It runs on an early 2015 MacBook Air with Broadwell core i5.

I also have several versions of the 4x4 matrix multiplication routine originally written for speed testing, with / without FMA, and using assembler / intrinsics. It is the same problem for all of them.

I pass to the routine a pointer to the first element of a 4x4 array of doubles which was created as double MatrixDummy[4][4]; and passed to the routine as (&MatrixDummy)[0][0]

The assembler routine is here:

//Routine multiplies the 4x4 matrices A * B and store the result in C
inline void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C)
{
__asm__ __volatile__  ("vmovupd %0, %%ymm0 \n\t"
                       "vmovupd %1, %%ymm1 \n\t"
                       "vmovupd %2, %%ymm2 \n\t"
                       "vmovupd %3, %%ymm3"
                       :
                       :
                       "m" (B[0]),
                       "m" (B[4]),
                       "m" (B[8]),
                       "m" (B[12])
                       :
                       "ymm0", "ymm1", "ymm2", "ymm3");


__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[0])
                      :
                      "m" (A[0]),
                      "m" (A[1]),
                      "m" (A[2]),
                      "m" (A[3])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");

__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[4])
                      :
                      "m" (A[4]),
                      "m" (A[5]),
                      "m" (A[6]),
                      "m" (A[7])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");

__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[8])
                      :
                      "m" (A[8]),
                      "m" (A[9]),
                      "m" (A[10]),
                      "m" (A[11])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");


__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[12])
                      :
                      "m" (A[12]),
                      "m" (A[13]),
                      "m" (A[14]),
                      "m" (A[15])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");

}

As a comparison, the following code is supposed to do the exact same thing, and does so using all compilers / optimization settings. Since everything works if I use this routine instead of the assembler one, I would expect that the error has to be in how ICC handles the assembler routine with -O2 optimization.

inline void Run3ForLoops_MultiplyMatrixByMatrix_OutputTo3(double *A, double *B, double *C){

int i, j, k;

double dummy[4][4];

for(j=0; j<4; j++) {
    for(k=0; k<4; k++) {
        dummy[j][k] = 0.0;
        for(i=0; I<4; i++) {
            dummy[j][k] += *(A+j*4+i)*(*(B+i*4+k));
        }
    }
}

for(j=0; j<4; j++) {
    for(k=0; k<4; k++) {
        *(C+j*4+k) = dummy[j][k];
    }
}


}

Any ideas? I am really confused.

The core problem with your code is the assumption that if you write a value to a register, that value is still going to be there in the next statement. This assumption is wrong. Between asm statements, the compiler may use any register as it likes. For example, it may decide to use ymm0 to copy a variable from one location to another between your statements, trashing its previous content.

The correct way to do inline assembly is to never ever refer to registers directly if there is no good reason to do so. Each value you want to keep between assembly statements needs to be placed in a variable using an appropriate operand. The manual is quite clear about this.

As an example, let me rewrite your code to use correct inline assembly:

#include <immintrin.h>

inline void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C)
{
    size_t i;

    /* the registers you use */
    __m256 a0, a1, a2, a3, b0, b1, b2, b3, sum;
    __m256 *B256 = (__m256 *)B, *C256 = (__m256 *)C;

    /* load values from B */
    asm ("vmovupd %1, %0" : "=x"(b0) : "m"(B256[0]));
    asm ("vmovupd %1, %0" : "=x"(b1) : "m"(B256[1]));
    asm ("vmovupd %1, %0" : "=x"(b2) : "m"(B256[2]));
    asm ("vmovupd %1, %0" : "=x"(b3) : "m"(B256[3]));

    for (i = 0; i < 4; i++) {
        /* load values from A */
        asm ("vbroadcastsd %1, %0" : "=x"(a0) : "m"(A[4 * i + 0]));
        asm ("vbroadcastsd %1, %0" : "=x"(a1) : "m"(A[4 * i + 1]));
        asm ("vbroadcastsd %1, %0" : "=x"(a2) : "m"(A[4 * i + 2]));
        asm ("vbroadcastsd %1, %0" : "=x"(a3) : "m"(A[4 * i + 3]));

        asm ("vmulpd %2, %1, %0"      : "=x"(sum) : "x"(a0), "x"(b0));
        asm ("vfmadd231pd %2, %1, %0" : "+x"(sum) : "x"(a1), "x"(b1));
        asm ("vfmadd231pd %2, %1, %0" : "+x"(sum) : "x"(a2), "x"(b2));
        asm ("vfmadd231pd %2, %1, %0" : "+x"(sum) : "x"(a3), "x"(b3));
        asm ("vmovupd %1, %0" : "=m"(C256[i]) : "x"(sum));
    }
}

There are a bunch of things you should immediately notice:

every register we use is described abstractly through an asm operand
all values we save are tied to local variables so the compiler can keep track of which registers are in use and which registers can be clobbered
since all dependencies and side effects of the asm statements are explicitly described through operands, no volatile qualifier is necessary and the compiler can optimize the code much better

Still, you should really consider using intrinsics instead as the compilers can do many more optimizations with intrinsics than they can with inline assembly. This is because the compiler to some extent understands what the intrinsic does and can use this knowledge to generate better code.