Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);
int main() {
volatile int va = 1;
volatile int vb = 2;
volatile int vc = 3;
volatile int vd = 4;
int a = va;
int b = vb;
int c = vc;
int d = vd;
int s = 0;
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
for (int i=0; i<2000000000; i++) {
s += test(a, b, c, d);
}
return s;
}
two.cpp:
bool test(int a, int b, int c, int d) {
// return a == d || b == d || c == d;
return false;
}
There are 16 nop
s in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native
.
Here are my questions:
- the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
- with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the
return false
version should be faster at least a little bit. But no, exactly the same speed! - If I remove
volatile
s from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures Optimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>:
560: 41 57 push r15
562: 41 56 push r14
564: 41 55 push r13
566: 41 54 push r12
568: 55 push rbp
569: 31 ed xor ebp,ebp
56b: 53 push rbx
56c: bb 00 94 35 77 mov ebx,0x77359400
571: 48 83 ec 18 sub rsp,0x18
575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
583: 00
584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
58b: 00
58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
593: 00
594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp]
598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4]
59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8]
5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc]
5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5b3: 00 00 00
5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5bd: 00 00 00
5c0: 44 89 e1 mov ecx,r12d
5c3: 44 89 ea mov edx,r13d
5c6: 44 89 f6 mov esi,r14d
5c9: 44 89 ff mov edi,r15d
5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)>
5d1: 0f b6 c0 movzx eax,al
5d4: 01 c5 add ebp,eax
5d6: 83 eb 01 sub ebx,0x1
5d9: 75 e5 jne 5c0 <main+0x60>
5db: 48 83 c4 18 add rsp,0x18
5df: 89 e8 mov eax,ebp
5e1: 5b pop rbx
5e2: 5d pop rbp
5e3: 41 5c pop r12
5e5: 41 5d pop r13
5e7: 41 5e pop r14
5e9: 41 5f pop r15
5eb: c3 ret
5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>:
560: 55 push rbp
561: 31 ed xor ebp,ebp
563: 53 push rbx
564: bb 00 94 35 77 mov ebx,0x77359400
569: 48 83 ec 08 sub rsp,0x8
56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
574: 00 00
576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
57d: 00 00 00
580: b9 04 00 00 00 mov ecx,0x4
585: ba 03 00 00 00 mov edx,0x3
58a: be 02 00 00 00 mov esi,0x2
58f: bf 01 00 00 00 mov edi,0x1
594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)>
599: 0f b6 c0 movzx eax,al
59c: 01 c5 add ebp,eax
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
5a3: 48 83 c4 08 add rsp,0x8
5a7: 89 e8 mov eax,ebp
5a9: 5b pop rbx
5aa: 5d pop rbp
5ab: c3 ret
5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]