RGBA to ABGR: Inline arm neon asm for iOS/Xcode

This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS

Errors in comments:

uint32_t *src;
uint32_t *dst;

#ifdef __ARM_NEON
__asm__ volatile(
    "vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected
    "vrev32.8 q0, q0            \n" // error: Unrecognized instruction mnemonic
    "vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected
    :
    : [src]"r"(src), [dst]"r"(dst)
    : "d0", "d1"
    );
#endif

What's wrong with this code?

EDIT1:

I rewrote the code using intrinsics:

uint8x16_t x = vreinterpretq_u8_u32(vld1q_u32(src));
uint8x16_t y = vrev32q_u8(x);
vst1q_u32(dst, vreinterpretq_u32_u8(y));

After disassembling, I get the following, which is a variation I have already tried:

vld1.32 {d16, d17}, [r0]!
vrev32.8    q8, q8
vst1.32 {d16, d17}, [r1]!

So my code looks like this now, but gives the exact same errors:

__asm__ volatile("vld1.32 {d0, d1}, [%0]! \n"
                 "vrev32.8 q0, q0         \n"
                 "vst1.32 {d0, d1}, [%1]! \n"
                 :
                 : "r"(src), "r"(dst)
                 : "d0", "d1"
                 );

EDIT2:

Reading through the disassembly, I actually found a second version of the function. It turns out that arm64 uses a slightly different instruction set. For example, the arm64 assembly uses rev32.16b v0, v0 instead. The whole function listing(which I can't make heads or tails of) is below:

_My_Function:
cmp     w2, #0
add w9, w2, #3
csel    w8, w9, w2, lt
cmp     w9, #7
b.lo    0x3f4
asr w9, w8, #2
ldr     x8, [x0]
mov  w9, w9
lsl x9, x9, #2
ldr q0, [x8], #16
rev32.16b   v0, v0
str q0, [x1], #16
sub x9, x9, #16
cbnz    x9, 0x3e0
ret

回答1:

I have successfully published several iOS apps which make use of ARM assembly language and inline code is the most frustrating way to do it. Apple still requires apps to support both ARM32 and ARM64 devices. Since the code will be built as both ARM32 and ARM64 by default (unless you changed the compile options), you need to design code which will successfully compile in both modes. As you noticed, ARM64 is a completely different mnemonic format and register model. There are 2 simple ways around this:

1) Write your code using NEON intrinsics. ARM specified that the original ARM32 intrinsics would remain mostly unchanged for ARMv8 targets and therefore can be compiled to both ARM32 and ARM64 code. This is the safest/easiest option.

2) Write inline code or a separate '.S' module for your assembly language code. To deal with the 2 compile modes, use "#ifdef __arm64__" and "#ifdef __arm__" to distinguish between the two instruction sets.

回答2:

Intrinsics are apparently the only way to use the same code for NEON between ARM (32-bit) and AArch64.

There are many reasons not to use inline-assembly: https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately, current compilers often do a very poor job with ARM / AArch64 intrinsics, which is surprising because they do an excellent job optimizing x86 SSE/AVX intrinsics and PowerPC Altivec. They often do ok in simple cases, but can easily introduce extra store/reloads.

In theory with intrinsics, you should get good asm output, and it lets the compiler schedule instructions between the vector load and store, which will help most on an in-order core. (Or you could write a whole loop in inline asm that you schedule by hand.)

ARM's official documentation:

Although it is technically possible to optimize NEON assembly by hand, this can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand assembly, ARM strongly recommends the use of intrinsics

If you do use inline asm anyway, avoid future pain by getting it right.

It's easy to write inline asm that happens to work, but isn't safe wrt. future source changes (and sometimes to future compiler optimizations), because the constraints don't accurately describe what the asm does. The symptoms will be weird, and this kind of context-sensitive bug could even lead to unit tests passing but wrong code in the main program. (or vice versa).

A latent bug that doesn't cause any defects in the current build is still a bug, and is a really Bad Thing in a Stackoverflow answer that can be copied as an example into other contexts. @bitwise's code in the question and self-answer both have bugs like this.

The inline asm in the question isn't safe, because it modifies memory telling the compiler about it. This probably only manifests in a loop that reads from dst in C both before and after the inline asm. However, it's easy to fix, and doing so lets us drop the volatile (and the `"memory" clobber which it's missing) so the compiler can optimize better (but still with significant limitations compared to intrinsics).

volatile should prevent reordering relative to memory accesses, so it may not happen outside of fairly contrived circumstances. But that's hard to prove.

The following compiles for ARM and AArch64 (it might fail if compiling for ILP32 on AArch64, though, I forgot about that possibility). Using -funroll-loops leads to gcc choosing different addressing modes, and not forcing the dst++; src++; to happen between every inline asm statement. (This maybe wouldn't be possible with asm volatile).

I used memory operands so the compiler knows that memory is an input and an output, and giving the compiler the option to use auto-increment / decrement addressing modes. This is better than anything you can do with a pointer in a register as an input operand, because it allows loop unrolling to work.

This still doesn't let the compiler schedule the store many instructions after the corresponding load to software pipeline the loop for in-order cores, so it's probably only going to perform decently on out-of-order ARM chips.

void bytereverse32(uint32_t *dst32, const uint32_t *src32, size_t len)
{
    typedef struct { uint64_t low, high; } vec128_t;
    const vec128_t *src = (const vec128_t*) src32;
    vec128_t *dst = (vec128_t*) dst32;

    // with old gcc, this gets gcc to use a pointer compare as the loop condition
    // instead of incrementing a loop counter
    const vec128_t *src_endp = src + len/(sizeof(vec128_t)/sizeof(uint32_t));
    // len is in units of 4-byte chunks

    while (src < src_endp) {

        #if defined(__ARM_NEON__) || defined(__ARM_NEON)
          #if __LP64__   // FIXME: doesn't account for ILP32 in 64-bit mode
        // aarch64 registers: s0 and d0 are subsets of q0 (128bit), synonym for v0
        asm ("ldr        q0, %[src] \n\t"
             "rev32.16b  v0, v0 \n\t"
             "str        q0, %[dst]  \n\t"
                     : [dst] "=<>m"(*dst)  // auto-increment/decrement or "normal" memory operand
                     : [src] "<>m" (*src)
                     : "q0", "v0"
                     );
          #else
        // arm32 registers: 128bit q0 is made of d0:d1, or s0:s3
        asm ("vld1.32   {d0, d1}, %[src] \n\t"
             "vrev32.8   q0, q0          \n\t"  // reverse 8 bit elements inside 32bit words
             "vst1.32   {d0, d1}, %[dst] \n"
                     : [dst] "=<>m"(*dst)
                     : [src] "<>m"(*src)
                     : "d0", "d1"
                     );
          #endif
        #else
         #error "no NEON"
        #endif

      // increment pointers by 16 bytes
        src++;   // The inline asm doesn't modify the pointers.
        dst++;   // of course, these increments may compile to a post-increment addressing mode
                 // this way has the advantage of letting the compiler unroll or whatever

     }
}

This compiles (on the Godbolt compiler explorer with gcc 4.8), but I don't know if it assembles, let alone works correctly. Still, I'm confident these operand constraints are correct. Constraints are basically the same across all architectures, and I understand them much better than I know NEON.

Anyway, the inner loop on ARM (32bit) with gcc 4.8 -O3, without -funroll-loops is:

.L4:
    vld1.32   {d0, d1}, [r1], #16   @ MEM[(const struct vec128_t *)src32_17]
    vrev32.8   q0, q0          
    vst1.32   {d0, d1}, [r0], #16   @ MEM[(struct vec128_t *)dst32_18]

    cmp     r3, r1    @ src_endp, src32
    bhi     .L4       @,

The register constraint bug

The code in the OP's self-answer has another bug: The input pointer operands uses separate "r" constraints. This leads to breakage if the compiler wants to keep the old value around, and chooses an input register for src that isn't the same as the output register.

If you want to take pointer inputs in registers and choose your own addressing modes, you can use "0" matching-constraints, or you can use "+r" read-write output operands.

You will also need a "memory" clobber or dummy memory input/output operands (i.e. that tell the compiler which bytes of memory are read and written, even if you don't use that operand number in the inline asm).

See Looping over arrays with inline assembly for a discussion of the advantages and disadvantages of using r constraints for looping over an array on x86. ARM has auto-increment addressing modes, which appear to produce better code than anything you could get with manual choice of addressing modes. It lets gcc use different addressing modes in different copies of the block when loop-unrolling. "r" (pointer) constraints appear to have no advantage, so I won't go into detail about how to use a dummy input / output constraint to avoid needing a "memory" clobber.

Test-case that generates wrong code with @bitwise's asm statement:

// return a value as a way to tell the compiler it's needed after
uint32_t* unsafe_asm(uint32_t *dst, const uint32_t *src)
{
  uint32_t *orig_dst = dst;

  uint32_t initial_dst0val = orig_dst[0];
#ifdef __ARM_NEON
  #if __LP64__
asm volatile("ldr q0, [%0], #16   # unused src input was %2\n\t"
             "rev32.16b v0, v0   \n\t"
             "str q0, [%1], #16   # unused dst input was %3\n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"  // ,"memory"
               // clobbers don't include v0?
            );
  #else
asm volatile("vld1.32 {d0, d1}, [%0]!  # unused src input was %2\n\t"
             "vrev32.8 q0, q0         \n\t"
             "vst1.32 {d0, d1}, [%1]!  # unused dst input was %3\n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1" // ,"memory"
             );
  #endif
#else
    #error "No NEON/AdvSIMD"
#endif

  uint32_t final_dst0val = orig_dst[0];
  // gcc assumes the asm doesn't change orig_dst[0], so it only does one load (after the asm)
  // and uses it for final and initial
  // uncomment the memory clobber, or use a dummy output operand, to avoid this.
  // pointer + initial+final compiles to LSL 3 to multiply by 8 = 2 * sizeof(uint32_t)


  // using orig_dst after the inline asm makes the compiler choose different registers for the
  // "=r"(dst) output operand and the "r"(dst) input operand, since the asm constraints
  // advertise this non-destructive capability.
  return orig_dst + final_dst0val + initial_dst0val;
}

This compiles to (AArch64 gcc4.8 -O3):

    ldr q0, [x1], #16   # unused src input was x1   // src, src
    rev32.16b v0, v0   
    str q0, [x2], #16   # unused dst input was x0   // dst, dst

    ldr     w1, [x0]  // D.2576, *dst_1(D)
    add     x0, x0, x1, lsl 3 //, dst, D.2576,
    ret

The store uses x2 (an uninitialized register, since this function only takes 2 args). The "=r"(dst) output (%1) picked x2, while the "r"(dst) input (%3 which is used only in a comment) picked x0.

final_dst0val + initial_dst0val compiles to 2x final_dst0val, because we lied to the compiler and told it that memory wasn't modified. So instead of reading the same memory before and after the inline asm statement, it just reads after and left-shifts by one extra position when adding to the pointer. (The return value exists only to use the values so they're not optimized away).

We can fix both problems by correcting the constraints: using "+r" for the pointers and adding a "memory" clobber. (A dummy output would also work, and might hurt optimization less.) I didn't bother since this appears to have no advantage over the memory-operand version above.

With those changes, we get

safe_register_pointer_asm:
    ldr     w3, [x0]  //, *dst_1(D)
    mov     x2, x0    // dst, dst    ### These 2 insns are new

    ldr q0, [x1], #16       // src
    rev32.16b v0, v0   
    str q0, [x2], #16       // dst

    ldr     w1, [x0]  // D.2597, *dst_1(D)
    add     x3, x1, x3, uxtw  // D.2597, D.2597, initial_dst0val   ## And this is new, to add the before and after loads
    add     x0, x0, x3, lsl 2 //, dst, D.2597,
    ret

回答3:

As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.

#ifdef __ARM_NEON
  #if __LP64__
asm volatile("ldr q0, [%0], #16  \n"
             "rev32.16b v0, v0   \n"
             "str q0, [%1], #16  \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
             "vrev32.8 q0, q0         \n"
             "vst1.32 {d0, d1}, [%1]! \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #endif
#else

The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.