I want to ensure that gcc knows:
- The pointers refer to non-overlapping chunks of memory
- The pointers have 32 byte alignments
Is the following the correct?
template<typename T, typename T2>
void f(const T* __restrict__ __attribute__((aligned(32))) x,
T2* __restrict__ __attribute__((aligned(32))) out) {}
Thanks.
Update:
I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.
But the assembly still uses unaligned moves instead of aligned moves.
Code (also at godbolt.org)
int square(const float* __restrict__ __attribute__((aligned(32))) x,
const int size,
float* __restrict__ __attribute__((aligned(32))) out0,
float* __restrict__ __attribute__((aligned(32))) out1,
float* __restrict__ __attribute__((aligned(32))) out2,
float* __restrict__ __attribute__((aligned(32))) out3,
float* __restrict__ __attribute__((aligned(32))) out4) {
for (int i = 0; i < size; ++i) {
out0[i] = x[i];
out1[i] = x[i] * x[i];
out2[i] = x[i] * x[i] * x[i];
out3[i] = x[i] * x[i] * x[i] * x[i];
out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
}
}
Assembly compiled with gcc 8.2 and "-march=haswell -O3" It is full of vmovups, which are unaligned moves.
.L3:
vmovups ymm1, YMMWORD PTR [rbx+rax]
vmulps ymm0, ymm1, ymm1
vmovups YMMWORD PTR [r14+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r15+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r12+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [rbp+0+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
and r13d, -8
vzeroupper
Same behavior even for sandybridge:
.L3:
vmovups xmm2, XMMWORD PTR [rbx+rax]
vinsertf128 ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
vmulps ymm0, ymm1, ymm1
vmovups XMMWORD PTR [r14+rax], xmm0
vextractf128 XMMWORD PTR [r14+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r13+0+rax], xmm0
vextractf128 XMMWORD PTR [r13+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r12+rax], xmm0
vextractf128 XMMWORD PTR [r12+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [rbp+0+rax], xmm0
vextractf128 XMMWORD PTR [rbp+16+rax], ymm0, 0x1
add rax, 32
cmp rax, rdx
jne .L3
and r15d, -8
vzeroupper
Using addition instead of multiplication (godbolt). Still unaligned moves.
No, using
float *__attribute__((aligned(32))) x
means that the pointer itself is stored in aligned memory, not pointing to aligned memory.1There is a way to do this, but it only helps for gcc, not clang or ICC.
See How to tell GCC that a pointer argument is always double-word-aligned? for
__builtin_assume_aligned
which works on all GNU C compatible compilers, and How can I apply __attribute__(( aligned(32))) to an int *? for more details about__attribute__((aligned(32)))
, which does work for GCC.I used
__restrict
instead of__restrict__
because that C++ extension name for C99restrict
is portable to all the mainstream x86 C++ compilers, including MSVC.(gcc, clang, and ICC output on the Godbolt compiler explorer).
GCC and clang will use
movaps
/vmovaps
instead ofups
any time it has a compile-time alignment guarantee. (Unlike MSVC and ICC which never usemovaps
for loads/stores, a missed optimization for anything that runs on Core2 / K10 or older). And as you noticed, it's applying the-mavx256-split-unaligned-load
/store
effects for tunings other than Haswell (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)., another clue that your syntax didn't work.vmovups
is not a performance problem when used on aligned memory; it performs identically tovmovaps
on all AVX-supporting CPUs when the address is aligned at runtime. So in practice there's no real problem with your-march=haswell
output. Only older CPUs, before Nehalem and Bulldozer, always decodedmovups
to multiple uops.The real benefit (these days) to telling the compiler about alignment guarantees is that compilers sometimes emit extra code for startup/cleanup loops to reach an alignment boundary. Or without AVX, compilers can't fold a load into a memory operand for
mulps
unless it's aligned.A good test case for this is
out0[i] = x[i] * y[i]
, where the load result is only needed once. Orout0[i] *= x[i]
. Knowing alignment enablesmovaps
/mulps xmm0, [rsi]
, otherwise it's 2xmovups
+mulps
. You can check for this optimization even on compilers like ICC or MSVC, which usemovups
even when they do know they have an alignment guarantee, but they will still make alignment-required code when they can fold a load into an ALU operation.It seems
__builtin_assume_aligned
is the only really portable (to GNU C compilers) way to do this. You can do hacks like passing pointers tostruct aligned_floats { alignas(32) float f[8]; };
, but that's just cumbersome to use, and unless you actually access memory through objects of that type, it doesn't get compilers to assume alignment. (e.g. casting a pointer to that back tofloat *
Using more than 4 output streams can hurt by resulting in more conflict misses in the cache. Skylake's L2 cache is only 4-way, for example. But L1d is 8-way so you're probably ok for small buffers.
If you want to saturate the store port uop throughput, use narrower stores (e.g. scalar), not wide SIMD stores that need more bandwidth per uop. Back-to-back stores to the same cache line may be able to merge in the store buffer before committing to L1d, so it depends what you want to test.
Semi-related: a 2x load + 1x store memory access pattern like
c[i] = a[i]+b[i]
or STREAM triad will come closest to maxing out total L1d cache load+store bandwidth on Intel Sandybridge-family CPUs. On SnB/IvB, 256-bit vectors take 2 cycles per load/store, leaving time for store-address uops to use the AGUs on ports 2 or 3 during the 2nd cycle of a load. On Haswell and later (256-bit wide load/store ports), the stores need to use a non-indexed addressing mode so they can use the simple-addressing-mode store AGU on port 7.But AMD CPUs can do up-to-2 memory ops per clock, with at most one being a store, so they'd max out with a copy-and-operate stores = loads pattern.
BTW, Intel recently announced Sunny Cove (successor to Ice Lake), which will have 2x load + 2x store throughput per clock, a 2nd vector shuffle ALU, and 5-wide issue/rename. So that's fun! Compilers will need to unroll loops by at least 2 to not bottleneck on 1-per-clock loop branches.
Footnote 1: That's why (if you compile without AVX), you get a warning, and gcc omits an
and rsp,-32
because it assumes RSP is already aligned. (It doesn't actually spill any YMM regs, so it should have optimized this out anyway, but gcc has had this missed-optimization bug for a while with locals or auto-vectorization-created objects with extra alignment.)