Here's some code which GCC 6 and 7 fail to optimize when using std::array
:
#include <array>
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef std::array<double, my_elements> Vec alignas(32);
#endif
void fun1(const Vec&);
Vec v1{{}};
};
void Foo::fun1(const Vec& __restrict__ v2)
{
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2[i];
}
}
Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY
produces nice code:
vmovapd ymm0, YMMWORD PTR [rdi]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi]
vmovapd YMMWORD PTR [rdi], ymm0
vmovapd ymm0, YMMWORD PTR [rdi+32]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi+32]
vmovapd YMMWORD PTR [rdi+32], ymm0
vzeroupper
That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY
, you get a huge mess starting with this:
mov rax, rdi
shr rax, 3
neg rax
and eax, 3
je .L7
The code generated in this case (using std::array
instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.
It seems that GCC doesn't understand that the contents of an std::array
are aligned the same as the std::array
itself. This breaks the assumption that using std::array
instead of C arrays does not incur a runtime cost.
Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:
void Foo::fun2(const Vec& __restrict__ v2)
{
typedef double V2 alignas(Foo::Vec);
const V2* v2a = static_cast<const V2*>(&v2[0]);
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2a[i];
}
}
Also note: if my_elements
is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.
You can see it live here: https://godbolt.org/g/IXIOst