Why would a compiler generate this assembly?

2020-08-09 09:05发布

问题:

While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:

void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}

It is compiled to this:

<invalidate()>        push   %rbx
<invalidate()+1>      mov    %rdi,%rbx
<invalidate()+4>      callq  0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9>      movl   $0xffffffff,0x564(%rbx)
<invalidate()+19>     movl   $0xffffffff,0x568(%rbx)
<invalidate()+29>     mov    0x564(%rbx),%rax
<invalidate()+36>     mov    %rax,0x56c(%rbx)
<invalidate()+43>     pop    %rbx
<invalidate()+44>     retq

The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.

So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?

(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)

回答1:

Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.



回答2:

The code is "less than perfect".

For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:

00000000  31C0              xor eax,eax
00000002  48F7D0            not rax
00000005  48898364050000    mov [rbx+0x564],rax
0000000C  4889836C050000    mov [rbx+0x56c],rax

;Note: XOR above clears RAX due to zero extension

For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:

xor eax,eax
not rax                 ;Must wait until previous instruction finishes
mov [rbx+0x564],rax     ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax     ;Must wait until "not" finishes

For performance you want to do this:

00000000  48C7C0FFFFFFFF        mov rax,0xffffffff
00000007  C78364050000FFFFFFFF  mov dword [rbx+0x564],0xffffffff
00000011  C78368050000FFFFFFFF  mov dword [rbx+0x568],0xffffffff
0000001B  C7836C050000FFFFFFFF  mov dword [rbx+0x56c],0xffffffff
00000025  C78370050000FFFFFFFF  mov dword [rbx+0x570],0xffffffff

;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension

This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).

If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:

mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax

This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.

Now; let's look at the original code and see why it is bad:

mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564]                ;Massive problem
mov [rbx+0x56C],rax                ;Depends on previous instruction

Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.



回答3:

I'd break down the lines as this (think several have comment same steps)

These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.

<invalidate()+9>      movl   $0xffffffff,0x564(%rbx)
<invalidate()+19>     movl   $0xffffffff,0x568(%rbx)

These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)

<invalidate()+29>     mov    0x564(%rbx),%rax
<invalidate()+36>     mov    %rax,0x56c(%rbx)

Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...



回答4:

In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)

Misaligned data access can incur significant performance penalties. This is particularly true for cache line splits. The size of a cache line is 64 bytes in the Pentium 4 and other recent Intel processors, including processors based on Intel Core microarchitecture.

An access to data unaligned on 64-byte boundary leads to two memory accesses and requires several μops to be executed (instead of one). Accesses that span 64-byte boundaries are likely to incur a large performance penalty, the cost of each stall generally are greater on machines with longer pipelines.

Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.

Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)

Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data.

; A. Large load stall
mov     mem, eax        ; Store dword to address “MEM"
mov     mem + 4, ebx    ; Store dword to address “MEM + 4"
fld     mem             ; Load qword at address “MEM", stalls

So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?

EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.