Even though I am compiling for armv7
only, NEON multiply-accumulate intrinsics appear to be being decomposed into separate multiplies and adds.
I've experienced this with several versions of Xcode up to the latest 4.5, with iOS SDKs 5 through 6, and with different optimisation settings, both building through Xcode and through the commandline directly.
For instance, building and disassembling some test.cpp
containing
#include <arm_neon.h>
float32x4_t test( float32x4_t a, float32x4_t b, float32x4_t c )
{
float32x4_t result = a;
result = vmlaq_f32( result, b, c );
return result;
}
with
clang++ -c -O3 -arch armv7 -o "test.o" test.cpp
otool -arch armv7 -tv test.o
results in
test.o:
(__TEXT,__text) section
__Z4test19__simd128_float32_tS_S_:
00000000 f10d0910 add.w r9, sp, #16 @ 0x10
00000004 46ec mov ip, sp
00000006 ecdc2b04 vldmia ip, {d18-d19}
0000000a ecd90b04 vldmia r9, {d16-d17}
0000000e ff420df0 vmul.f32 q8, q9, q8
00000012 ec432b33 vmov d19, r2, r3
00000016 ec410b32 vmov d18, r0, r1
0000001a ef400de2 vadd.f32 q8, q8, q9
0000001e ec510b30 vmov r0, r1, d16
00000022 ec532b31 vmov r2, r3, d17
00000026 4770 bx lr
instead of the expected use of vmla.f32
.
What am I doing wrong, please?
Neon multiply-add instructions perform the operation
Note that the destination and one of the sources is the same. If you want to perform the operation
the compiler will have to decompose it into two instructions
Alternative, it can decompose it into multiply + add instructions
On Cortex-A8/A9 both variants have the same throughput, but on Cortex-A8 the second variant has lower latency, because multiply-add instruction causes stalls in many situations.
It is either a bug or an optimization by llvm-clang. armcc or gcc produces vmla as you expect but if you read Cortex-A Series Programmer’s Guide v3, it says:
So it makes sense for llvm-clang to separate vmla into multiply and accumulate to fill the pipeline.