I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
This produces a main loop like this
20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add rax,0x20
33: cmp rax,rcx
36: jne 20
But the cmp
instruction is unnecessary. Instead of having rax
start at zero and finish at sizeof(float)*n
we can set the base pointers (rsi
, rdi
, and rdx
) to the end of the array and set rax
to -sizeof(float)*n
and then test for zero. I am able to do this with my own assembly code like this
.L2 vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm0, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm0
add rax, 32
jne .L2
but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp
instruction is not produced with GCC.
I tried the following but it still produces cmp
. All variations I have tried still produce cmp
.
void triad2(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
float *x2 = x+n;
float *y2 = y+n;
float *z2 = z+n;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=-n; i<0; i+=8) {
_mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
}
}
Edit:
I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048
). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).
Edit:
Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp
, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).
SSE AVX FMA
intrinsic 71.3% 90.9% 53.6%
unroll1 97.0% 96.1% 63.5%
unroll16 98.6% 90.4% 93.6%
ScottD 96.5%
32B code align 95.5%
For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp
. On AVX I get the best result without unrolling and without using cmp
. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.
Edit:
Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp
matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp
. On Haswell only unrolling works anyway.
How about this. Compiler is gcc 4.9.0 mingw x64:
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
intptr_t i;
__m256 k4 = _mm256_set1_ps(k);
for(i = -n; i < 0; i += 8) {
_mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
}
}
gcc -c -O3 -march=corei7 -mavx2 triad.c
0000000000000000 <triad>:
0: 44 89 c8 mov eax,r9d
3: f7 d8 neg eax
5: 48 98 cdqe
7: 48 85 c0 test rax,rax
a: 79 31 jns 3d <triad+0x3d>
c: c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
14: 4d 63 c9 movsxd r9,r9d
17: 49 c1 e1 02 shl r9,0x2
1b: 4c 01 ca add rdx,r9
1e: 4c 01 c9 add rcx,r9
21: 4d 01 c8 add r8,r9
24: c5 f4 59 04 82 vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
29: c5 fc 58 04 81 vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
2e: c4 c1 7c 29 04 80 vmovaps YMMWORD PTR [r8+rax*4],ymm0
34: 48 83 c0 08 add rax,0x8
38: 78 ea js 24 <triad+0x24>
3a: c5 f8 77 vzeroupper
3d: c3 ret
Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.
The instruction decoder on Intel Ivy Bridge or later can fuse the cmp and jne into a single operation in the pipeline (called macro-op fusion), so on these recent processors the cmp should disappear anyway.
Final code:
#define SF sizeof(float)
#ifndef NO //floats per vector, compile with -DNO = 1,2,4,8,...
#define NO 8 //MUST be power of two
#endif
void triadfinaler(float const *restrict x, float const *restrict y, \
float *restrict z, size_t n)
{
float *restrict d = __builtin_assume_aligned(z, NO*SF); //gcc builtin,
float const *restrict m = __builtin_assume_aligned(y, NO*SF); //optional but produces
float const *restrict a = __builtin_assume_aligned(x, NO*SF); //better code
float const k = 3.14159f;
n*=SF;
while (n &= ~((size_t)(NO*SF)-1)) //this is why NO*SF must be power of two
{
size_t nl = n/SF;
for (size_t i = 0; i<NO; i++)
{
d[nl-NO+i] = k * m[nl-NO+i] + a[nl-NO+i];
}
n -= (NO*SF);
}
}
I prefer to let the compiler choose the instructions, rather than using intrinsics (not least because you used intel-intrinsics, which gcc doesn't really like). Anyway, the following code produces nice assembly for me on gcc 4.8:
void triad(float *restrict x, float *restrict y, float *restrict z, size_t n)
//I hope you weren't aliasing any function arguments... Oh, an it's void, not float
{
float *restrict d = __builtin_assume_aligned(z, 32); // Uh, make sure your arrays
float *restrict m = __builtin_assume_aligned(y, 32); // are aligned? Faster that way
float *restrict a = __builtin_assume_aligned(x, 32); //
float const k = 3.14159f;
while (n &= ~((size_t)0x7)) //black magic, causes gcc to omit code for non-multiples of 8 floats
{
n -= 8; //You were always computing on 8 floats at a time, right?
d[n+0] = k * m[n+0] + a[n+0]; //manual unrolling
d[n+1] = k * m[n+1] + a[n+1];
d[n+2] = k * m[n+2] + a[n+2];
d[n+3] = k * m[n+3] + a[n+3];
d[n+4] = k * m[n+4] + a[n+4];
d[n+5] = k * m[n+5] + a[n+5];
d[n+6] = k * m[n+6] + a[n+6];
d[n+7] = k * m[n+7] + a[n+7];
}
}
This produces nice code for my corei7avx2, with -O3:
triad:
andq $-8, %rcx
je .L8
vmovaps .LC0(%rip), %ymm1
.L4:
subq $8, %rcx
vmovaps (%rsi,%rcx,4), %ymm0
vfmadd213ps (%rdi,%rcx,4), %ymm1, %ymm0
vmovaps %ymm0, (%rdx,%rcx,4)
andq $-8, %rcx
jne .L4
vzeroupper
.L8:
rep ret
.cfi_endproc
.LC0:
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
Edit:
I was a bit disappointed with the compiler not optimizing this code down to the last instruction, so I messed around with it a bit more. Just changing the order of things in the loop got rid of the AND
emitted by the compiler, which got me on the right track. I then only had to get it to not do unnecessary address calculation in the loop instead. Sigh.
void triadtwo(float *restrict x, float *restrict y, float *restrict z, size_t n)
{
float *restrict d = __builtin_assume_aligned(z, 32);
float *restrict m = __builtin_assume_aligned(y, 32);
float *restrict a = __builtin_assume_aligned(x, 32);
float const k = 3.14159f;
n<<=2;
while (n &= -32)
{
d[(n>>2)-8] = k * m[(n>>2)-8] + a[(n>>2)-8];
d[(n>>2)-7] = k * m[(n>>2)-7] + a[(n>>2)-7];
d[(n>>2)-6] = k * m[(n>>2)-6] + a[(n>>2)-6];
d[(n>>2)-5] = k * m[(n>>2)-5] + a[(n>>2)-5];
d[(n>>2)-4] = k * m[(n>>2)-4] + a[(n>>2)-4];
d[(n>>2)-3] = k * m[(n>>2)-3] + a[(n>>2)-3];
d[(n>>2)-2] = k * m[(n>>2)-2] + a[(n>>2)-2];
d[(n>>2)-1] = k * m[(n>>2)-1] + a[(n>>2)-1];
n -= 32;
}
}
Ugly code? Yes. But the assembly:
triadtwo:
salq $2, %rcx
andq $-32, %rcx
je .L54
vmovaps .LC0(%rip), %ymm1
.L50:
vmovaps -32(%rsi,%rcx), %ymm0
vfmadd213ps -32(%rdi,%rcx), %ymm1, %ymm0
vmovaps %ymm0, -32(%rdx,%rcx)
subq $32, %rcx
jne .L50
vzeroupper
.L54:
rep ret
.cfi_endproc
.LC0:
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
.long 1078530000
Mmmmhhh, glorious five instructions in the loop, macro-op fusable subtract-and-branch...