Using NEON multiply accumulate on iOS

2020-07-25 23:01发布

Even though I am compiling for armv7 only, NEON multiply-accumulate intrinsics appear to be being decomposed into separate multiplies and adds.

I've experienced this with several versions of Xcode up to the latest 4.5, with iOS SDKs 5 through 6, and with different optimisation settings, both building through Xcode and through the commandline directly.

For instance, building and disassembling some test.cpp containing

#include <arm_neon.h>

float32x4_t test( float32x4_t a, float32x4_t b, float32x4_t c )
{
   float32x4_t result = a;
   result = vmlaq_f32( result, b, c );
   return result;
}

with

clang++ -c -O3 -arch armv7 -o "test.o" test.cpp
otool -arch armv7 -tv test.o

results in

test.o:
(__TEXT,__text) section
__Z4test19__simd128_float32_tS_S_:
00000000    f10d0910    add.w   r9, sp, #16 @ 0x10
00000004        46ec    mov ip, sp
00000006    ecdc2b04    vldmia  ip, {d18-d19}
0000000a    ecd90b04    vldmia  r9, {d16-d17}
0000000e    ff420df0    vmul.f32    q8, q9, q8
00000012    ec432b33    vmov    d19, r2, r3
00000016    ec410b32    vmov    d18, r0, r1
0000001a    ef400de2    vadd.f32    q8, q8, q9
0000001e    ec510b30    vmov    r0, r1, d16
00000022    ec532b31    vmov    r2, r3, d17
00000026        4770    bx  lr

instead of the expected use of vmla.f32.

What am I doing wrong, please?

2条回答
beautiful°
2楼-- · 2020-07-25 23:42

Neon multiply-add instructions perform the operation

c = c + a * b

Note that the destination and one of the sources is the same. If you want to perform the operation

d = c + a * b

the compiler will have to decompose it into two instructions

d = c
d = d + a * b

Alternative, it can decompose it into multiply + add instructions

d = a * b
d = d + c

On Cortex-A8/A9 both variants have the same throughput, but on Cortex-A8 the second variant has lower latency, because multiply-add instruction causes stalls in many situations.

查看更多
三岁会撩人
3楼-- · 2020-07-25 23:44

It is either a bug or an optimization by llvm-clang. armcc or gcc produces vmla as you expect but if you read Cortex-A Series Programmer’s Guide v3, it says:

20.2.3 Scheduling

In some cases there can be a considerable latency, particularly VMLA multiply-accumulate (five cycles for an integer; seven cycles for a floating-point). Code using these instructions should be optimized to avoid trying to use the result value before it is ready, otherwise a stall will occur. Despite having a few cycles result latency, these instructions do fully pipeline so several operations can be in flight at once.

So it makes sense for llvm-clang to separate vmla into multiply and accumulate to fill the pipeline.

查看更多
登录 后发表回答