I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6 trunk. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to this self-contained example:
#include <iostream>
#include <immintrin.h>
#include <string.h>
struct simd_pack
{
enum { num_vectors = 1 };
__m256i _val[num_vectors];
};
simd_pack load_broken(int8_t *p)
{
simd_pack pack;
for (int i = 0; i < simd_pack::num_vectors; ++i) pack._val[i] = _mm256_loadu_si256(reinterpret_cast<__m256i *>(p + i * 32));
return pack;
}
void store_broken(int8_t *p, simd_pack pack)
{
for (int i = 0; i < simd_pack::num_vectors; ++i) _mm256_storeu_si256(reinterpret_cast<__m256i *>(p + i * 32), pack._val[i]);
}
void test_broken(int8_t *out, int8_t *in1, size_t n)
{
size_t i = 0;
for (; i + 31 < n; i += 32)
{
simd_pack p1 = load_broken(in1 + i);
store_broken(out + i, p1);
}
}
int main()
{
int8_t in_buf[256];
int8_t out_buf[256];
for (size_t i = 0; i < 256; ++i) in_buf[i] = i;
test_broken(out_buf, in_buf, 256);
if (memcmp(in_buf, out_buf, 256)) std::cout << "test_broken() failed!" << std::endl;
return 0;
}
A summary of the above: I have a simple type called simd_pack
that contains one member, an array of one __m256i
value. In my application, there are operators and functions that take these types, but the problem can be illustrated by the above example. Specifically, test_broken()
should read from the in1
array and then just copy its value over to the out
array. Therefore, the call to memcmp()
in main()
should return zero. I compile the above using the following:
clang++-3.6 bug_test.cc -o bug_test -mavx -O3
I find that on optimization levels -O0
and -O1
, the test passes, while on levels -O2
and -O3
, the test fails. I've tried compiling the same file with gcc 4.4, 4.6, 4.7, and 4.8, as well as Intel C++ 13.0, and the test passes on all optimization levels.
Taking a closer look at the generated code, here's the assembly generated on optimization level -O3
:
0000000000400a40 <test_broken(signed char*, signed char*, unsigned long)>:
400a40: 55 push %rbp
400a41: 48 89 e5 mov %rsp,%rbp
400a44: 48 81 e4 e0 ff ff ff and $0xffffffffffffffe0,%rsp
400a4b: 48 83 ec 40 sub $0x40,%rsp
400a4f: 48 83 fa 20 cmp $0x20,%rdx
400a53: 72 2f jb 400a84 <test_broken(signed char*, signed char*, unsigned long)+0x44>
400a55: 31 c0 xor %eax,%eax
400a57: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
400a5e: 00 00
400a60: c5 fc 10 04 06 vmovups (%rsi,%rax,1),%ymm0
400a65: c5 f8 29 04 24 vmovaps %xmm0,(%rsp)
400a6a: c5 fc 28 04 24 vmovaps (%rsp),%ymm0
400a6f: c5 fc 11 04 07 vmovups %ymm0,(%rdi,%rax,1)
400a74: 48 8d 48 20 lea 0x20(%rax),%rcx
400a78: 48 83 c0 3f add $0x3f,%rax
400a7c: 48 39 d0 cmp %rdx,%rax
400a7f: 48 89 c8 mov %rcx,%rax
400a82: 72 dc jb 400a60 <test_broken(signed char*, signed char*, unsigned long)+0x20>
400a84: 48 89 ec mov %rbp,%rsp
400a87: 5d pop %rbp
400a88: c5 f8 77 vzeroupper
400a8b: c3 retq
400a8c: 0f 1f 40 00 nopl 0x0(%rax)
I'll reproduce the key part for emphasis:
400a60: c5 fc 10 04 06 vmovups (%rsi,%rax,1),%ymm0
400a65: c5 f8 29 04 24 vmovaps %xmm0,(%rsp)
400a6a: c5 fc 28 04 24 vmovaps (%rsp),%ymm0
400a6f: c5 fc 11 04 07 vmovups %ymm0,(%rdi,%rax,1)
This is kind of head-scratching. It first loads 256 bits into ymm0
using the unaligned move that I asked for, then it stores xmm0
(which only contains the lower 128 bits of the data that was read) to the stack, then immediately reads 256 bits into ymm0
from the stack location that was just written to. The effect is that ymm0
's upper 128 bits (which get written to the output buffer) are garbage, causing the test to fail.
Is there some good reason why this could be happening, other than just a compiler bug? Am I violating some rule by having the simd_pack
type hold an array of __m256i
values? It certainly seems to be related to that; if I change _val
to be a single value instead of an array, then the generated code works as intended. However, my application requires _val
to be an array (its length is dependent upon a C++ template parameter).
Any ideas?