c inline assembly getting “operand size mismatch”

2019-03-01 11:14发布

问题:

I'm trying to use cmpxchg with inline assembly through c. This is my code:

static inline int
cas(volatile void* addr, int expected, int newval) {
    int ret;
    asm volatile("movl %2 , %%eax\n\t"
                "lock; cmpxchg %0, %3\n\t"
                "pushfl\n\t"
                "popl %1\n\t"
                "and $0x0040, %1\n\t"
                : "+m" (*(int*)addr), "=r" (ret)
                : "r" (expected), "r" (newval)
                : "%eax"
                );
    return ret;
}

This is my first time using inline and i'm not sure what could be causing this problem. I tried "cmpxchgl" as well, but still nothing. Also tried removing the lock. I get "operand size mismatch". I think maybe it has something to do with the casting i do to addr, but i'm unsure. I try and exchange int for int, so don't really understand why there would be a size mismatch. This is using AT&T style. Thanks

回答1:

You had the operand order for the cmpxchg instruction is reversed. AT&T syntax needs the memory destination last:

    "lock; cmpxchg %3, %0\n\t"

Or you could compile that instruction with its original order using -masm=intel, but the rest of your code is AT&T syntax and ordering so that's not the right answer.


As far as why it says "operand size mismatch", I can only say that that appears to be an assembler bug, in that it uses the wrong message.



回答2:

As @prl points out, you reversed the operands, putting them in Intel order (See Intel's manual entry for cmpxchg). Any time your inline asm doesn't assemble, you should look at the asm the compiler was feeding to the assembler to see what happened to your template. In your case, simply remove the static inline so the compiler will make a stand-alone definition, then you get (on the Godbolt compiler explorer):

    movl %edx , %eax
    lock; cmpxchg %ebx, (%ecx)       # error on this line from the assembler
    pushfl
    popl %edx
    and $0x0040, %edx

Sometimes that will clue your eye / brain in cases where staring at %3 and %0 didn't, especially after you check the instruction-set reference manual entry for cmpxchg and see that the memory operand is the destination (Intel-syntax first operand, AT&T syntax last operand).

This makes sense because the explicit register operand is only ever a source, while EAX and the memory operand are both read and then one or the other is written depending on the success of the compare. (And semantically you use cmpxchg as a conditional store to a memory destination.)


You're discarding the load result from the cas-failure case. I can't think of any use-cases for cmpxchg where doing a separate load of the atomic value would be incorrect, rather than just inefficient, but the usual semantics for a CAS function is that oldval is taken by reference and updated on failure. (At least that's how C++11 std::atomic and C11 stdatomic do it with bool atomic_compare_exchange_weak( volatile A *obj, C* expected, C desired );.)

(The weak/strong thing allows better code-gen for CAS retry-loops on targets that use LL/SC, where spurious failure is possible due to an interrupt or being rewritten with the same value. x86's lock cmpxchg is "strong")

Actually, GCC's legacy __sync builtins provide 2 separate CAS functions: one that returns the old value, and one that returns a bool. Both take the old/new value by reference. So it's not the same API that C++11 uses, but apparently it isn't so horrible that nobody used it.


Your overcomplicated code isn't portable to x86-64. From your use of popl, I assume you developed it on x86-32. You don't need pushf/pop to get ZF as an integer; that's what setcc is for. cmpxchg example for 64 bit integer has a 32-bit example that works that way (to show what they want a 64-bit version of).

Or even better, use GCC6 flag-return syntax so using this in a loop can compile to a cmpxchg / jne loop instead of cmpxchg / setz %al / test %al,%al / jnz.

We can fix all of those problems and improve the register allocation as well. (If the first or last instruction of an inline-asm statement is mov, you're probably using constraints inefficiently.)

Of course, by far the best thing for real usage would be to use C11 stdatomic or a GCC builtin. https://gcc.gnu.org/wiki/DontUseInlineAsm in cases where the compiler can emit just as good (or better) asm from code it "understands", because inline asm constrains the compiler. It's also difficult to write correctly / efficient, and to maintain.

Portable to i386 and x86-64, AT&T or Intel syntax, and works for any integer type width of register width or smaller:

// Note: oldVal by reference
static inline char CAS_flagout(int *ptr, int *poldVal, int newVal)
{
    char ret;
    __asm__ __volatile__ (
            "  lock; cmpxchg  {%[newval], %[mem] | %[mem], %[newval]}\n"
            : "=@ccz" (ret), [mem] "+m" (*ptr), "+a" (*poldVal)
            : [newval]"r" (newVal)
            : "memory");    // barrier for compiler reordering around this

    return ret;   // ZF result, 1 on success else 0
}


// spinning read-only is much better (with _mm_pause in the retry loop)
// not hammering on the cache line with lock cmpxchg.
// This is over-simplified so the asm is super-simple.
void cas_retry(int *lock) {
    int oldval = 0;
    while(!CAS_flagout(lock, &oldval, 1)) oldval = 0;
}

The { foo,bar | bar,foo } is ASM dialect alternatives. For x86, it's {AT&T | Intel}. The %[newval] is a named operand constraint; it's another way to keep your operands . The "=ccz" takes the z condition code as the output value, like a setz.

Compiles on Godbolt to this asm for 32-bit x86 with AT&T output:

cas_retry:
    pushl   %ebx
    movl    8(%esp), %edx      # load the pointer arg.
    movl    $1, %ecx
    xorl    %ebx, %ebx
.L2:
    movl    %ebx, %eax          # xor %eax,%eax would save a lot of insns
      lock; cmpxchg  %ecx, (%edx) 

    jne     .L2
    popl    %ebx
    ret

gcc is dumb and stores a 0 in one reg before copying it to eax, instead of re-zeroing eax inside the loop. This is why it needs to save/restore EBX at all. It's the same asm we get from avoiding inline-asm, though (from x86 spinlock using cmpxchg):

// also omits _mm_pause and read-only retry, see the linked question
void spin_lock_oversimplified(int *p) {
    while(!__sync_bool_compare_and_swap(p, 0, 1));
}

Someone should teach gcc that Intel CPUs can materialize a 0 more cheaply with xor-zeroing than they can copy it with mov, especially on Sandybridge (xor-zeroing elimination but no mov-elimination).