Aligned and unaligned memory access with AVX/AVX2

2020-06-11 14:11发布

问题:

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g.

vaddps ymm0,ymm0,YMMWORD PTR [rax]

the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as

vmovaps ymm0,YMMWORD PTR [rax]

the load address has to be aligned (to multiples of 32), otherwise an exception is raised.

What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:

#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE (1L << 26)
#define OFFSET 1

int main() {
  float *data;
  assert(!posix_memalign((void**)&data, 32, SIZE*sizeof(float)));
  for (unsigned i = 0; i < SIZE; i++) data[i] = drand48();
  float res[8]  __attribute__ ((aligned(32)));
  __m256 sum = _mm256_setzero_ps(), elem;
  for (float *d = data + OFFSET; d < data + SIZE - 8; d += 8) {
    elem = _mm256_load_ps(d);
    // sum = _mm256_add_ps(elem, elem);
    sum = _mm256_add_ps(sum, elem);
  }
  _mm256_store_ps(res, sum);
  for (int i = 0; i < 8; i++) printf("%g ", res[i]); printf("\n");
  return 0;
}

(Yes, I know the code is faulty, since I use an aligned load on unaligned addresses, but bear with me...)

I compile the code with

g++ -Wall -O3 -march=native -o memtest memtest.C

on a CPU with AVX. If I check the code generated by g++ by using

objdump -S -M intel-mnemonic memtest | more

I see that the compiler does not generate an aligned load instruction, but loads the data directly in the vector addition instruction:

vaddps ymm0,ymm0,YMMWORD PTR [rax]

The code executes without any problem, even though the memory addresses are not aligned (OFFSET is 1). This is clear since vaddps tolerates unaligned addresses.

If I uncomment the line with the second addition intrinsic, the compiler cannot fuse the load and the addition since vaddps can only have a single memory source operand, and generates:

vmovaps ymm0,YMMWORD PTR [rax]
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0

And now the program seg-faults, since a dedicated aligned load instruction is used, but the memory address is not aligned. (The program doesn't seg-fault if I use _mm256_loadu_ps, or if I set OFFSET to 0, by the way.)

This leaves the programmer at the mercy of the compiler and makes the behavior partly unpredictable, in my humble opinion.

My question is: Is there a way to force the C compiler to either generate a direct load in a processing instruction (such as vaddps) or to generate a dedicated load instruction (such as vmovaps)?

回答1:

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.

In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded

vmulps  ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps  ymm8, ymm9, ymm8

However in a previous verison of GCC the loads were not folded:

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.



回答2:

In addition to Z boson's answer I can tell that the compiler is rightfully doing load folding because it assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime, however, that attribute does not work for values on the stack because the stack is only 16-byte aligned (see this bug). You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.



标签: gcc avx avx2