I was recently starting to use ICC (18.0.1.126) to compile a code that worked fine with GCC and Clang on arbitrary optimization settings. The code contains an assembler routine that multiplies 4x4 matrices of doubles using AVX2 and FMA instructions. After much fiddling it turned out that the assembler routine is working properly when compiled with -O1 - xcore-avx2, but gives a wrong numerical result when compiled with -O2 - xcore-avx2. The code compiles however without any error messages on all optimization settings. It runs on an early 2015 MacBook Air with Broadwell core i5.
I also have several versions of the 4x4 matrix multiplication routine originally written for speed testing, with / without FMA, and using assembler / intrinsics. It is the same problem for all of them.
I pass to the routine a pointer to the first element of a 4x4 array of doubles which was created as double MatrixDummy[4][4]; and passed to the routine as (&MatrixDummy)[0][0]
The assembler routine is here:
//Routine multiplies the 4x4 matrices A * B and store the result in C
inline void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C)
{
__asm__ __volatile__ ("vmovupd %0, %%ymm0 \n\t"
"vmovupd %1, %%ymm1 \n\t"
"vmovupd %2, %%ymm2 \n\t"
"vmovupd %3, %%ymm3"
:
:
"m" (B[0]),
"m" (B[4]),
"m" (B[8]),
"m" (B[12])
:
"ymm0", "ymm1", "ymm2", "ymm3");
__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
"vbroadcastsd %2, %%ymm5 \n\t"
"vbroadcastsd %3, %%ymm6 \n\t"
"vbroadcastsd %4, %%ymm7 \n\t"
"vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
"vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
"vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
"vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
"vmovupd %%ymm8, %0"
:
"=m" (C[0])
:
"m" (A[0]),
"m" (A[1]),
"m" (A[2]),
"m" (A[3])
:
"ymm4", "ymm5", "ymm6", "ymm7", "ymm8");
__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
"vbroadcastsd %2, %%ymm5 \n\t"
"vbroadcastsd %3, %%ymm6 \n\t"
"vbroadcastsd %4, %%ymm7 \n\t"
"vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
"vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
"vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
"vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
"vmovupd %%ymm8, %0"
:
"=m" (C[4])
:
"m" (A[4]),
"m" (A[5]),
"m" (A[6]),
"m" (A[7])
:
"ymm4", "ymm5", "ymm6", "ymm7", "ymm8");
__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
"vbroadcastsd %2, %%ymm5 \n\t"
"vbroadcastsd %3, %%ymm6 \n\t"
"vbroadcastsd %4, %%ymm7 \n\t"
"vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
"vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
"vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
"vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
"vmovupd %%ymm8, %0"
:
"=m" (C[8])
:
"m" (A[8]),
"m" (A[9]),
"m" (A[10]),
"m" (A[11])
:
"ymm4", "ymm5", "ymm6", "ymm7", "ymm8");
__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
"vbroadcastsd %2, %%ymm5 \n\t"
"vbroadcastsd %3, %%ymm6 \n\t"
"vbroadcastsd %4, %%ymm7 \n\t"
"vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
"vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
"vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
"vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
"vmovupd %%ymm8, %0"
:
"=m" (C[12])
:
"m" (A[12]),
"m" (A[13]),
"m" (A[14]),
"m" (A[15])
:
"ymm4", "ymm5", "ymm6", "ymm7", "ymm8");
}
As a comparison, the following code is supposed to do the exact same thing, and does so using all compilers / optimization settings. Since everything works if I use this routine instead of the assembler one, I would expect that the error has to be in how ICC handles the assembler routine with -O2 optimization.
inline void Run3ForLoops_MultiplyMatrixByMatrix_OutputTo3(double *A, double *B, double *C){
int i, j, k;
double dummy[4][4];
for(j=0; j<4; j++) {
for(k=0; k<4; k++) {
dummy[j][k] = 0.0;
for(i=0; I<4; i++) {
dummy[j][k] += *(A+j*4+i)*(*(B+i*4+k));
}
}
}
for(j=0; j<4; j++) {
for(k=0; k<4; k++) {
*(C+j*4+k) = dummy[j][k];
}
}
}
Any ideas? I am really confused.