How can I apply __attribute__(( aligned(32))) to a

2019-02-18 02:41发布

问题:

In my program I need to apply __attribute__(( aligned(32))) to an int * or float * I tried like this but I'm not sure it will work.

int  *rarray __attribute__(( aligned(32)));

I saw this but didn't find the answer

回答1:

So you want to tell the compiler that your pointers are aligned? e.g. that all callers of this function will pass pointers that are guaranteed to be aligned. Either pointers to aligned static or local storage, or pointers they got from C11 aligned_alloc or POSIX posix_memalign. (If those aren't available, _mm_malloc is one option, but free isn't guaranteed to be safe on _mm_malloc results: you need _mm_free). This allows the compiler to auto-vectorize without making a bunch of bloated code to handle unaligned inputs.

When you manually vectorize with intrinsics, you use _mm256_loadu_si256 or _mm256_load_si256 to inform the compiler whether memory is or isn't aligned. Communicating alignment information is the main point of load/store intrinsics, as opposed to simply dereferencing __m256i pointers.


I don't think there's a portable way to inform the compiler that a pointer points to aligned memory. (C11 / C++11 alignas doesn't seem to be able to do that, see below).

With GNU C __attribute__ syntax, it seems to be necessary to use a typedef to get the attribute to apply to the pointed-to type, rather than to the pointer itself. It's definitely easier to type and easier to read if you declare an aligned_int type or something.

// Only helps GCC, not clang or ICC
typedef __attribute__(( aligned(32)))  int aligned_int;
int my_func(const aligned_int *restrict a, const aligned_int *restrict b) {
    int sum = 0;
    for (int i=0 ; i<1024 ; i++) {
        sum += a[i] - b[i];
    }
    return sum;
}

this auto-vectorizes without any bloat for handling unaligned inputs (gcc 5.3 with -O3 on godbolt)

    pxor    xmm0, xmm0
    xor     eax, eax
.L2:
    psubd   xmm0, XMMWORD PTR [rsi+rax]
    paddd   xmm0, XMMWORD PTR [rdi+rax]
    add     rax, 16
    cmp     rax, 4096
    jne     .L2          # end of vector loop

    ...   # horizontal sum with psrldq omitted, see the godbolt link if you're curious
    movd    eax, xmm0
    ret

Without the aligned attribute, you get a big block of scalar intro/outro code, which would be even worse with -march=haswell to make AVX2 code with a wider inner loop.


Clang's normal strategy for unaligned inputs is to use unaligned loads/stores, instead of fully-unrolled intro/outro loops. Without AVX, this means the loads couldn't be folded into memory operands for SSE ALU operations.

The aligned attribute doesn't help clang (tested as recently as clang7.0): it still uses separate movdqu loads. Note that clang's loop is bigger because it defaults to unrolling by 4, whereas gcc doesn't unroll at all without -funroll-loops (which is enabled by -fprofile-use).

But note, this aligned_int typedef only works for GCC itself, not clang or ICC. gcc memory alignment pragma has another example.

__builtin_assume_aligned is noisier syntax, but does work across all compilers that support GNU C extensions.

See How to tell GCC that a pointer argument is always double-word-aligned?


Note that you can't make an array of aligned_int. (see comments for discussion of sizeof(aligned_int), and the fact that it's still 4, not 32). GNU C refuses to treat it as an int-with-padding, so with gcc 5.3:

static aligned_int arr[1024];
// error: alignment of array elements is greater than element size
int tmp = sizeof(arr);

clang-3.8 compiles that, and initializes tmp to 4096. Presumably because it's just totally ignoring the aligned attribute in that context, not doing whatever magic gcc does to have a type that's narrower than its required alignment. (So only every fourth element actually has that alignment.)

The gcc docs claim that using the aligned attribute on a struct does let you make an array, and that this is one of the main use-cases. However, as @user3528438 pointed out in comments, this is not the case: you get the same error as when trying to declare an array of aligned_int. This has been the case since 2005.


To define aligned local or static/global arrays, the aligned attribute should be applied to the entire array, rather than to every element.

In portable C11 and C++11, you can use things like alignas(32) int myarray[1024];. See also Struggling with alignas syntax: it seems to only be useful for aligning things themselves, not declaring that pointers point to aligned memory. std::align is more like ((uintptr_t)ptr) & ~63 or something: forcibly aligning a pointer rather than telling the compiler it was already aligned.

// declaring aligned storage for arrays
#ifndef __cplusplus
#include <stdalign.h>   // for C11: defines alignas() using _Alignas()
#endif                  // C++11 defines alignas without any headers

// works for global/static or local  (aka automatic storage)
alignas(32) int foo[1000];      // portable ISO C++11 and ISO C11 syntax


// __attribute__((aligned(32))) int foo[1000];  // older GNU C
// __declspec something  // older MSVC

See the C11 alignas() documentation on cppreference.

CPP macros can be useful to choose between GNU C __attribute__ syntax and MSVC __declspec syntax for alignment if you want portability on older compilers that don't support C11.

e.g. with this code that declares a local array with more alignment than can be assumed for the stack pointer, the compiler has to make space and then AND the stack pointer to get an aligned pointer:

void foo(int *p);
void bar(void) {
  __attribute__((aligned(32))) int a[1000];
  foo (a);
}

compiles to (clang-3.8 -O3 -std=gnu11 for x86-64)

    push    rbp
    mov     rbp, rsp       # stack frame with base pointer since we're doing unpredictable things to rsp
    and     rsp, -32       # 32B-align the stack
    sub     rsp, 4032      # reserve up to 32B more space than needed
    lea     rdi, [rsp]     # this is weird:  mov rdi,rsp  is a shorter insn to set up foo's arg
    call    foo
    mov     rsp, rbp
    pop     rbp
    ret

gcc (later than 4.8.2) makes significantly larger code doing a bunch of extra work for no reason, the strangest being push QWORD PTR [r10-8] to copy some stack memory to another place on the stack. (check it out on the godbolt link: flip clang to gcc).



标签: c gcc simd