Is this incorrect code generation with arrays of _

I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6 trunk. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to this self-contained example:

#include <iostream>
#include <immintrin.h>
#include <string.h>

struct simd_pack
{
    enum { num_vectors = 1 };
    __m256i _val[num_vectors];
};

simd_pack load_broken(int8_t *p)
{
    simd_pack pack;
    for (int i = 0; i < simd_pack::num_vectors; ++i) pack._val[i] = _mm256_loadu_si256(reinterpret_cast<__m256i *>(p + i * 32));
    return pack;
}

void store_broken(int8_t *p, simd_pack pack)
{
    for (int i = 0; i < simd_pack::num_vectors; ++i) _mm256_storeu_si256(reinterpret_cast<__m256i *>(p + i * 32), pack._val[i]);    
}

void test_broken(int8_t *out, int8_t *in1, size_t n)
{
    size_t i = 0;
    for (; i + 31 < n; i += 32)
    {
        simd_pack p1 = load_broken(in1 + i);
        store_broken(out + i, p1);
    }   
}

int main()
{
    int8_t in_buf[256];
    int8_t out_buf[256];
    for (size_t i = 0; i < 256; ++i) in_buf[i] = i;

    test_broken(out_buf, in_buf, 256);
    if (memcmp(in_buf, out_buf, 256)) std::cout << "test_broken() failed!" << std::endl;    

    return 0;
}

A summary of the above: I have a simple type called simd_pack that contains one member, an array of one __m256i value. In my application, there are operators and functions that take these types, but the problem can be illustrated by the above example. Specifically, test_broken() should read from the in1 array and then just copy its value over to the out array. Therefore, the call to memcmp() in main() should return zero. I compile the above using the following:

clang++-3.6 bug_test.cc -o bug_test -mavx -O3

I find that on optimization levels -O0 and -O1, the test passes, while on levels -O2 and -O3, the test fails. I've tried compiling the same file with gcc 4.4, 4.6, 4.7, and 4.8, as well as Intel C++ 13.0, and the test passes on all optimization levels.

Taking a closer look at the generated code, here's the assembly generated on optimization level -O3:

0000000000400a40 <test_broken(signed char*, signed char*, unsigned long)>:
  400a40:       55                      push   %rbp
  400a41:       48 89 e5                mov    %rsp,%rbp
  400a44:       48 81 e4 e0 ff ff ff    and    $0xffffffffffffffe0,%rsp
  400a4b:       48 83 ec 40             sub    $0x40,%rsp
  400a4f:       48 83 fa 20             cmp    $0x20,%rdx
  400a53:       72 2f                   jb     400a84 <test_broken(signed char*, signed char*, unsigned long)+0x44>
  400a55:       31 c0                   xor    %eax,%eax
  400a57:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  400a5e:       00 00 
  400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
  400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
  400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
  400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)
  400a74:       48 8d 48 20             lea    0x20(%rax),%rcx
  400a78:       48 83 c0 3f             add    $0x3f,%rax
  400a7c:       48 39 d0                cmp    %rdx,%rax
  400a7f:       48 89 c8                mov    %rcx,%rax
  400a82:       72 dc                   jb     400a60 <test_broken(signed char*, signed char*, unsigned long)+0x20>
  400a84:       48 89 ec                mov    %rbp,%rsp
  400a87:       5d                      pop    %rbp
  400a88:       c5 f8 77                vzeroupper 
  400a8b:       c3                      retq   
  400a8c:       0f 1f 40 00             nopl   0x0(%rax)

I'll reproduce the key part for emphasis:

  400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
  400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
  400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
  400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)

This is kind of head-scratching. It first loads 256 bits into ymm0 using the unaligned move that I asked for, then it stores xmm0 (which only contains the lower 128 bits of the data that was read) to the stack, then immediately reads 256 bits into ymm0 from the stack location that was just written to. The effect is that ymm0's upper 128 bits (which get written to the output buffer) are garbage, causing the test to fail.

Is there some good reason why this could be happening, other than just a compiler bug? Am I violating some rule by having the simd_pack type hold an array of __m256i values? It certainly seems to be related to that; if I change _val to be a single value instead of an array, then the generated code works as intended. However, my application requires _val to be an array (its length is dependent upon a C++ template parameter).

Any ideas?

This is a bug in clang. The fact that it happened at -O0 is a good clue that the bug is in the front-end, and in this case, it's a dark corner of the x86-64 ABI implementation related to handling of a struct that contains a vector array of exactly size 1!

The bug has been present for years, but this is the first time that anyone has hit it, noticed it, and reported it. Thanks!

http://llvm.org/bugs/show_bug.cgi?id=22563