How to prevent LDM/STM instructions expansion in A

I'm trying to generate ~~AXI~~ bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

But ARM Compiler armcc User Guide, paragraph 7.18 is saying: "All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."

And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary. This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

Target CPU: Cortex M0+ (ARMv6-M).

Edit: Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).

In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.

You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.

See: ldm/stm and gcc for issues with GCC stm/ldm.

Taking your example,

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.

This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.

However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).

If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.