Better way to load vectors from memory. (clang)

I'm writing a test program to get used to Clang's language extensions for OpenCL style vectors. I can get the code to work but I'm having issues getting one aspect of it down. I can't seem to figure out how to get clang to just load in a vector from a scalar array nicely.

At the moment I have to do something like:

byte16 va = (byte16){ argv[1][start], argv[1][start + 1], argv[1][start + 2], 
                      argv[1][start + 3], argv[1][start + 4], argv[1][start + 5], 
                      argv[1][start + 6], argv[1][start + 7], argv[1][start + 8],
                      argv[1][start + 9], argv[1][start + 10], argv[1][start + 11],
                      argv[1][start + 12], argv[1][start + 13], argv[1][start + 14],
                      argv[1][start + 15]};

I would ideally like something like this:

byte16 va = *(byte16 *)(&(argv[1][start]));

Which I can easily do using the proper intrinsics for ARM or x86. But that code causes the program to crash although it compiles.

One of the reasons the crash might occur on x86 is due to alignment issues. I do not have clang on my system to reproduce the problem, but I can demonstrate it at the example of GCC.

If you do something like:

/* Define a vector type of 16 characters.  */
typedef char __attribute__ ((vector_size (16))) byte16;

/* Global pointer.  */
char *  foo;

byte16 test ()
{
  return *(byte16 *)&foo[1];
}

Now if you compile it on a vector-capable x86 with:

$  gcc -O3 -march=native -mtune=native   a.c

You will get the following assembly for test:

test:
    movq foo(%rip), %rax
    vmovdqa 1(%rax), %xmm0
    ret

Please note, that the move is aligned, which is of course wrong. Now, if you would inline this function into the main, and you will have something like:

int main ()
{
  foo = __builtin_malloc (22);
  byte16 x = *(byte16 *)&foo[1];
  return x[0];
}

You will be fine, and you will get unaligned instruction. This is kind of a bug, which doesn't have a very good fix in the compiler, as it would require interprocedural optimisations with addition of new data structures, etc.

The origin of the problem is that the compiler assumes that vector types are aligned, so when you dereference an array of aligned vector types you can use an aligned move. As a workaround for the problem in GCC one can define an unaligned vector type like:

typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16;

And use it to dereference unaligned memory.

I am not sure that you are hitting exactly this problem within your setup, but this is something that I would recommend to check by inspecting the assembly output from your compiler.