Generated code not matching expectations with Exte

I have a CpuFeatures class. The requirements for the class are simple: (1) preserve EBX or RBX, and (2) record the values returned from CPUID in EAX/EBX/ECX/EDX. I'm not sure the code being generated is the code I intended.

The CpuFeatures class code uses GCC Extended ASM. Here's the relevant code:

struct CPUIDinfo
{
    word32 EAX;
    word32 EBX;
    word32 ECX;
    word32 EDX;
};

bool CpuId(word32 func, word32 subfunc, CPUIDinfo& info)
{
    uintptr_t scratch;

    __asm__ __volatile__ (

        ".att_syntax \n"

#if defined(__x86_64__)
        "\t xchgq %%rbx, %q1 \n"
#else
        "\t xchgl %%ebx, %k1 \n"
#endif

        "\t cpuid \n"

#if defined(__x86_64__)
        "\t xchgq %%rbx, %q1 \n"
#else
        "\t xchgl %%ebx, %k1 \n"
#endif

      : "=a"(info.EAX), "=&r"(scratch), "=c"(info.ECX), "=d"(info.EDX)
      : "a"(func), "c"(subfunc)
    );

    if(func == 0)
        return !!info.EAX;

    return true;
}

The code below was compiled with -g3 -Og on Cygwin i386. When I examine it under a debugger, I'm don't like what I am seeing.

Dump of assembler code for function CpuFeatures::DoDetectX86Features():
   ...
   0x0048f355 <+1>:     sub    $0x48,%esp
=> 0x0048f358 <+4>:     mov    $0x0,%ecx
   0x0048f35d <+9>:     mov    %ecx,%eax
   0x0048f35f <+11>:    xchg   %ebx,%ebx
   0x0048f361 <+13>:    cpuid
   0x0048f363 <+15>:    xchg   %ebx,%ebx
   0x0048f365 <+17>:    mov    %eax,0x10(%esp)
   0x0048f369 <+21>:    mov    %ecx,0x18(%esp)
   0x0048f36d <+25>:    mov    %edx,0x1c(%esp)
   0x0048f371 <+29>:    mov    %ebx,0x14(%esp)
   0x0048f375 <+33>:    test   %eax,%eax
   ...

I don't like what I am seeing because it appears EBX/RBX is not being preserved (xchg %ebx,%ebx at +11). Additionally, it looks like the preserved EBX/RBX is being saved as the result of CPUID, and not the actual value of EBX returned by CPUID (xchg %ebx,%ebx at +15, before the mov %ebx,0x14(%esp) at +29).

If I change the operand to use a memory op with "=&m"(scratch), then the generated code is:

0x0048f35e <+10>:    xchg   %ebx,0x40(%esp)
0x0048f362 <+14>:    cpuid
0x0048f364 <+16>:    xchg   %ebx,0x40(%esp)

A related question is What ensures reads/writes of operands occurs at desired times with extended ASM?

What am I doing wrong (besides wasting countless hours on something that should have taken 5 or 15 minutes)?

The code below is a complete example that I used to compile your example code above including the modification to exchange(swap) directly to the info.EBX variable.

#include <inttypes.h>
#define word32 uint32_t

struct CPUIDinfo
{
    word32 EAX;
    word32 EBX;
    word32 ECX;
    word32 EDX;
};

bool CpuId(word32 func, word32 subfunc, CPUIDinfo& info)
{
    __asm__ __volatile__ (

        ".att_syntax \n"

#if defined(__x86_64__)
        "\t xchgq %%rbx, %q1 \n"
#else
        "\t xchgl %%ebx, %k1 \n"
#endif

        "\t cpuid \n"

#if defined(__x86_64__)
        "\t xchgq %%rbx, %q1 \n"
#else
        "\t xchgl %%ebx, %k1 \n"
#endif

      : "=a"(info.EAX), "=&m"(info.EBX), "=c"(info.ECX), "=d"(info.EDX)
      : "a"(func), "c"(subfunc)
    );

    if(func == 0)
        return !!info.EAX;

    return true;
}

int main()
{
    CPUIDinfo  cpuInfo;
    CpuId(1, 0, cpuInfo);
}

The first observation that you should make is that I chose to use the info.EBX memory location to do the actual swap to. This eliminates needing a another temporary variable or register.

I assembled as 32-bit code with -g3 -Og -S -m32 and got these instructions of interest:

xchgl %ebx, 4(%edi)
cpuid
xchgl %ebx, 4(%edi)

movl    %eax, (%edi)
movl    %ecx, 8(%edi)
movl    %edx, 12(%edi)

%edi happens to contain the address of the info structure. 4(%edi) happens to be the address of info.EBX. We swap %ebx and 4(%edi) after cpuid. With that instruction ebx is restored to what it was before cpuid and 4(%edi) now has what ebx was right after cpuid was executed. The remaining movl lines place eax, ecx, edx registers into the rest of the info structure via the %edi register.

The generated code above is what I would expect it to be.

Your code with the scratch variable (and using the constraint "=&m"(scratch)) never gets used after the assembler template so %ebx,0x40(%esp) has the value you want but it never gets moved anywhere useful. You'd have to copy the scratch variable into info.EBX (ie. info.EBX = scratch;)and look at all of the resulting instructions that get generated. At some point the data would be copied from the scratch memory location to info.EBX among the generated assembly instructions.

Update - Cygwin and MinGW

I wasn't entirely satisfied that the Cygwin code output was correct. In the middle of the night I had an Aha! moment. Windows already does its own position independent code when the dynamic link loader loads an image (DLL etc) and modifies the image via re-basing. There is no need for additional PIC processing like it is done in Linux 32 bit shared libraries so there is no issue with ebx/rbx. This is why Cygwin and MinGW will present warnings like this when compiling with -fPIC

warning: -fPIC ignored for target (all code is position independent)

This is because under Windows all 32bit code can be re-based when it is loaded by the Windows dynamic loader. More about re-basing can be found in this Dr. Dobbs article. Information on the windows Portable Executable format (PE) can be found in this Wiki article. Cygwin and MinGW don't need to worry about preserving ebx/rbx when targeting 32bit code because on their platforms PIC is already handled by the OS, other re-basing tools, and the linker.