In my program I need to apply __attribute__(( aligned(32)))
to an int *
or float *
I tried like this but I'm not sure it will work.
int *rarray __attribute__(( aligned(32)));
I saw this but didn't find the answer
In my program I need to apply __attribute__(( aligned(32)))
to an int *
or float *
I tried like this but I'm not sure it will work.
int *rarray __attribute__(( aligned(32)));
I saw this but didn't find the answer
So you want to tell the compiler that your pointers are aligned? e.g. that all callers of this function will pass pointers that are guaranteed to be aligned. Either pointers to aligned static or local storage, or pointers they got from C11 aligned_alloc
or POSIX posix_memalign
. (If those aren't available, _mm_malloc
is one option, but free
isn't guaranteed to be safe on _mm_malloc
results: you need _mm_free
). This allows the compiler to auto-vectorize without making a bunch of bloated code to handle unaligned inputs.
When you manually vectorize with intrinsics, you use _mm256_loadu_si256
or _mm256_load_si256
to inform the compiler whether memory is or isn't aligned. Communicating alignment information is the main point of load/store intrinsics, as opposed to simply dereferencing __m256i
pointers.
I don't think there's a portable way to inform the compiler that a pointer points to aligned memory. (C11 / C++11 alignas
doesn't seem to be able to do that, see below).
With GNU C __attribute__
syntax, it seems to be necessary to use a typedef
to get the attribute to apply to the pointed-to type, rather than to the pointer itself. It's definitely easier to type and easier to read if you declare an aligned_int
type or something.
// Only helps GCC, not clang or ICC
typedef __attribute__(( aligned(32))) int aligned_int;
int my_func(const aligned_int *restrict a, const aligned_int *restrict b) {
int sum = 0;
for (int i=0 ; i<1024 ; i++) {
sum += a[i] - b[i];
}
return sum;
}
this auto-vectorizes without any bloat for handling unaligned inputs (gcc 5.3 with -O3
on godbolt)
pxor xmm0, xmm0
xor eax, eax
.L2:
psubd xmm0, XMMWORD PTR [rsi+rax]
paddd xmm0, XMMWORD PTR [rdi+rax]
add rax, 16
cmp rax, 4096
jne .L2 # end of vector loop
... # horizontal sum with psrldq omitted, see the godbolt link if you're curious
movd eax, xmm0
ret
Without the aligned attribute, you get a big block of scalar intro/outro code, which would be even worse with -march=haswell
to make AVX2 code with a wider inner loop.
Clang's normal strategy for unaligned inputs is to use unaligned loads/stores, instead of fully-unrolled intro/outro loops. Without AVX, this means the loads couldn't be folded into memory operands for SSE ALU operations.
The aligned
attribute doesn't help clang (tested as recently as clang7.0): it still uses separate movdqu
loads. Note that clang's loop is bigger because it defaults to unrolling by 4, whereas gcc doesn't unroll at all without -funroll-loops
(which is enabled by -fprofile-use
).
But note, this aligned_int
typedef only works for GCC itself, not clang or ICC. gcc memory alignment pragma has another example.
__builtin_assume_aligned
is noisier syntax, but does work across all compilers that support GNU C extensions.See How to tell GCC that a pointer argument is always double-word-aligned?
Note that you can't make an array of aligned_int
. (see comments for discussion of sizeof(aligned_int)
, and the fact that it's still 4, not 32). GNU C refuses to treat it as an int
-with-padding, so with gcc 5.3:
static aligned_int arr[1024];
// error: alignment of array elements is greater than element size
int tmp = sizeof(arr);
clang-3.8 compiles that, and initializes tmp
to 4096. Presumably because it's just totally ignoring the aligned
attribute in that context, not doing whatever magic gcc does to have a type that's narrower than its required alignment. (So only every fourth element actually has that alignment.)
The gcc docs claim that using the aligned
attribute on a struct does let you make an array, and that this is one of the main use-cases. However, as @user3528438 pointed out in comments, this is not the case: you get the same error as when trying to declare an array of aligned_int
. This has been the case since 2005.
To define aligned local or static/global arrays, the aligned
attribute should be applied to the entire array, rather than to every element.
In portable C11 and C++11, you can use things like alignas(32) int myarray[1024];
. See also Struggling with alignas syntax: it seems to only be useful for aligning things themselves, not declaring that pointers point to aligned memory. std::align
is more like ((uintptr_t)ptr) & ~63
or something: forcibly aligning a pointer rather than telling the compiler it was already aligned.
// declaring aligned storage for arrays
#ifndef __cplusplus
#include <stdalign.h> // for C11: defines alignas() using _Alignas()
#endif // C++11 defines alignas without any headers
// works for global/static or local (aka automatic storage)
alignas(32) int foo[1000]; // portable ISO C++11 and ISO C11 syntax
// __attribute__((aligned(32))) int foo[1000]; // older GNU C
// __declspec something // older MSVC
See the C11 alignas()
documentation on cppreference.
CPP macros can be useful to choose between GNU C __attribute__
syntax and MSVC __declspec
syntax for alignment if you want portability on older compilers that don't support C11.
e.g. with this code that declares a local array with more alignment than can be assumed for the stack pointer, the compiler has to make space and then AND
the stack pointer to get an aligned pointer:
void foo(int *p);
void bar(void) {
__attribute__((aligned(32))) int a[1000];
foo (a);
}
compiles to (clang-3.8 -O3 -std=gnu11
for x86-64)
push rbp
mov rbp, rsp # stack frame with base pointer since we're doing unpredictable things to rsp
and rsp, -32 # 32B-align the stack
sub rsp, 4032 # reserve up to 32B more space than needed
lea rdi, [rsp] # this is weird: mov rdi,rsp is a shorter insn to set up foo's arg
call foo
mov rsp, rbp
pop rbp
ret
gcc (later than 4.8.2) makes significantly larger code doing a bunch of extra work for no reason, the strangest being push QWORD PTR [r10-8]
to copy some stack memory to another place on the stack. (check it out on the godbolt link: flip clang to gcc).