Code alignment in one object file is affecting the

2019-01-15 20:58发布

问题:

I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment.

Here is a function I have been trying this on a Ivy Bridge system

void triad(float *x, float *y, float *z, int n, int repeat) {
    float k = 3.14159f;
    int(int r=0; r<repeat; r++) {
        for(int i=0; i<n; i++) {
            z[i] = x[i] + k*y[i];
        }
    }
}

The assembly I'm using for this is below. If I don't specify the alignment my performance compared to the peak is only about 90%. However, when I align the code before the loop as well as both inner loops to 16 bytes the performance jumps to 96%. So clearly the code alignment in this case makes a difference.

But here is the strangest part. If I align the innermost loop to 32 bytes it makes no difference in the performance of this function, however, in another version of this function using intrinsics in a separate object file I link in its performance jumps from 90% to 95%!

I did an object dump (using objdump -d -M intel) of the version aligned to 16 bytes (I posted the result to the end of this question) and 32 bytes and they are identical! It turns out that the inner most loop is aligned to 32 bytes anyway in both object files. But there must be some difference.

I did a hex dump of each object file and there is one byte in the object files that differ. The object file aligned to 16 bytes has a byte with 0x10 and the object file aligned to 32 bytes has a byte with 0x20. What exactly is going on! Why does code alignment in one object file affect the performance of a function in another object file? How do I know what is the optimal value to align my code to?

My only guess is that when the code is relocated by the loader that the 32 byte aligned object file affects the other object file using intrinsics. You can find the code to test all this at Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

The NASM code I am using:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

Result from objdump -d -M intel test16.o. The disassembly is identical if I change align 16 to align 32 in the assembly above just before .L2. However, the object files still differ by one byte.

test16.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <pi>:
   0:   d0 0f                   ror    BYTE PTR [rdi],1
   2:   49                      rex.WB
   3:   40 90                   rex xchg eax,eax
   5:   90                      nop
   6:   90                      nop
   7:   90                      nop
   8:   90                      nop
   9:   90                      nop
   a:   90                      nop
   b:   90                      nop
   c:   90                      nop
   d:   90                      nop
   e:   90                      nop
   f:   90                      nop

0000000000000010 <triad_avx_asm_repeat>:
  10:   48 c1 e1 02             shl    rcx,0x2
  14:   48 01 cf                add    rdi,rcx
  17:   48 01 ce                add    rsi,rcx
  1a:   48 01 ca                add    rdx,rcx
  1d:   c4 e2 7d 18 15 da ff    vbroadcastss ymm2,DWORD PTR [rip+0xffffffffffffffda]        # 0 <pi>
  24:   ff ff 
  26:   90                      nop
  27:   90                      nop
  28:   90                      nop
  29:   90                      nop
  2a:   90                      nop
  2b:   90                      nop
  2c:   90                      nop
  2d:   90                      nop
  2e:   90                      nop
  2f:   90                      nop

0000000000000030 <triad_avx_asm_repeat.L1>:
  30:   48 89 c8                mov    rax,rcx
  33:   48 f7 d8                neg    rax
  36:   90                      nop
  37:   90                      nop
  38:   90                      nop
  39:   90                      nop
  3a:   90                      nop
  3b:   90                      nop
  3c:   90                      nop
  3d:   90                      nop
  3e:   90                      nop
  3f:   90                      nop

0000000000000040 <triad_avx_asm_repeat.L2>:
  40:   c5 ec 59 0c 07          vmulps ymm1,ymm2,YMMWORD PTR [rdi+rax*1]
  45:   c5 f4 58 0c 06          vaddps ymm1,ymm1,YMMWORD PTR [rsi+rax*1]
  4a:   c5 fc 29 0c 02          vmovaps YMMWORD PTR [rdx+rax*1],ymm1
  4f:   48 83 c0 20             add    rax,0x20
  53:   75 eb                   jne    40 <triad_avx_asm_repeat.L2>
  55:   41 83 e8 01             sub    r8d,0x1
  59:   75 d5                   jne    30 <triad_avx_asm_repeat.L1>
  5b:   c5 f8 77                vzeroupper 
  5e:   c3                      ret    
  5f:   90                      nop

回答1:

The confusing nature of the effect (the assembled code doesn't change!) you are seeing is due to section alignment. When using the ALIGN macro in NASM, it actually has two separate effects:

  1. Add 0 or more nop instructions so that the next instruction is aligned to the specified power-of-two boundary.

  2. Issue an implicit SECTALIGN macro call which will set the section alignment directive to alignment amount1.

The first point is the commonly understood behavior for align. It aligns the loop relatively within the section in the output file.

The second part is also needed however: imagine your loop was aligned to a 32 byte boundary in the assembled section, but then the runtime loader put your section, in memory, at an address aligned only to 8 bytes: this would make the in-file alignment quite pointless. To fix this, most executable formats allow each section to specify an alignment requirement, and the runtime loader/linker will be sure to load the section at a memory address which respects the requirement.

That's what the hidden SECTALIGN macro does - it ensures that your ALIGN macro works.

For your file, there is no difference in the assembled code between ALIGN 16 and ALIGN 32 because the next 16-byte boundary happens to also be the next 32-byte boundary (of course, every other 16-byte boundary is a 32-byte one, so that happens about half the time). The implicit SECTALIGN call is still different though, and that's the one byte difference you see in your hexdump. The 0x20 is decimal 32, and the 0x10 is decimal 16.

You can verify this with objdump -h <binary>. Here's an example on a binary I aligned to 32 bytes:

objdump -h loop-test.o

loop-test.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000d18a  0000000000000000  0000000000000000  00000180  2**5
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

The 2**5 in the Algn column is the 32-byte alignment. With 16-byte alignment this changes to 2**4.

Now it should be clear what happens - aligning the first function in your example changes the section alignment, but not the assembly. When you linked your program together, the linker will merge the various .text sections and pick the highest alignment.

At runtime, then this causes the code to be aligned to a 32-byte boundary - but this doesn't affect the first function, because it isn't alignment sensitive. Since the linker has merged your object files into one section, the larger alignment of 32 changes the alignment of every function (and instruction) in the section, including your other method, and so it changes the performance of your other function, which is alignment-sensitive.


1To be precise, SECTALIGN only changes the section alignment if the current section alignment is less than the specified amount - so the final section alignment will be the same as the largest SECTALIGN directive in the section.



回答2:

Ahhh, code alignment...

Some basics of code alignment..

  • Most intel architectures fetch 16B worth of instructions per clock.
  • The branch predictor has a larger window and looks at typically double that, per clock. The idea is to get ahead of the instructions fetched.
  • How your code is aligned will dictate which instructions you have available to decode and predict at any given clock (simple code locality argument).
  • Most modern intel architectures cache instructions at various levels (either at the macro instructions level before decoding, or at the micro instruction level after decoding). This eliminates the effects of code alignment, as long as you executing out of the micro/macro cache.
  • Also, most modern intel architectures have some form of loop stream detector that detects loops, again, executing them out of some cache that bypasses the front end fetch mechanism.
  • Some intel architectures are finicky about what they can cache, and what they can't. There are often dependencies on number of instructions/uops/alignment/branches/etc. Alignment may, in some cases, affect what's cached and what's not, and you can create cases where padding can prevent or cause a loop to get cached.
  • To make things even more complicated, the addresses of instructions are also use by the branch predictor. They are used in several ways, including (1) as a lookup into a branch prediction buffer to predict branches, (2) as a key/value to maintain some form of global state of branch behavior for prediction purposes, (3) as a key into determining indirect branch targets, etc.. Therefore, alignment can actually have a pretty huge impact on branch prediction, in some case, due to aliasing or other poor prediction.
  • Some architectures use instruction addresses to determine when to prefetch data, and code alignment can interfere with that, if just the right conditions exist.
  • Aligning loops is not always a good thing to do, depending on how the code is laid out (especially if there's control flow in the loop).

Having said all that blah blah, your issue could be one of any of these. It's important to look at the disassembly of not just the object, but the executable. You want to see what the final addresses are after everything is linked. Making changes in one object, could affect the alignment/addresses of instructions in another object after linking.

In some cases, it's near impossible to align your code in such a way as to maximize performance, simply due to so many low level architectural behaviors being hard to control and predict (that doesn't necessarily mean this is always the case). In some cases, your best bet is to have some default alignment strategy (say align all entries on 16B boundaries, and outer loops the same) so as you minimize the amount your performance varies from change-to-change. As a general strategy, aligning function entries is good. Aligning loops that are relatively small is good, as long as you're not adding nops in your execution path.

Beyond that, I'd need more info/data to pinpoint your exact problem, but thought some of this may help.. Good luck :)