I made my first approach with vectorization intrinsics with SSE, where there is basically only one data type __m128i
. Switching to Neon I found the data types and function prototypes to be much more specific, e.g. uint8x16_t
(a vector of 16 unsigned char
), uint8x8x2_t
(2 vectors with 8 unsigned char
each), uint32x4_t
(a vector with 4 uint32_t
) etc.
First I was enthusiastic (much easier to find the exact function operating on the desired data type), then I saw what a mess it was when wanting to treat the data in different ways. Using specific casting operators would take me forever. The problem is also addressed here. I then came up with the idea of an union encapsulated into a struct, and some casting and assignment operators.
struct uint_128bit_t { union {
uint8x16_t uint8x16;
uint16x8_t uint16x8;
uint32x4_t uint32x4;
uint8x8x2_t uint8x8x2;
uint8_t uint8_array[16] __attribute__ ((aligned (16) ));
uint16_t uint16_array[8] __attribute__ ((aligned (16) ));
uint32_t uint32_array[4] __attribute__ ((aligned (16) ));
};
operator uint8x16_t& () {return uint8x16;}
operator uint16x8_t& () {return uint16x8;}
operator uint32x4_t& () {return uint32x4;}
operator uint8x8x2_t& () {return uint8x8x2;}
uint8x16_t& operator =(const uint8x16_t& in) {uint8x16 = in; return uint8x16;}
uint8x8x2_t& operator =(const uint8x8x2_t& in) {uint8x8x2 = in; return uint8x8x2;}
};
This approach works for me: I can use a variable of type uint_128bit_t
as an argument and output with different Neon intrinsics, e.g. vshlq_n_u32
, vuzp_u8
, vget_low_u8
(in this case just as input). And I can extend it with more data types if I need.
Note: The arrays are to easily print the content of a variable.
Is this a correct way of proceeding?
Is there any hidden flaw?
Have I reinvented the wheel?
(Is the aligned attribute necessary?)
According to the C++ Standard, this data type is nearly useless (and certainly so for the purpose you intend). That's because reading from an inactive member of a union is undefined behavior.
It is possible, however, that your compiler promises to make this work. However, you haven't asked about any particular compiler, so it is impossible to comment further on that.
Since the initial proposed method has undefined behaviour in C++, I have implemented something like this:
template <typename T>
struct NeonVectorType {
private:
T data;
public:
template <typename U>
operator U () {
BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
U u;
memcpy( &u, &data, sizeof u );
return u;
}
template <typename U>
NeonVectorType<T>& operator =(const U& in) {
BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
memcpy( &data, &in, sizeof data );
return *this;
}
};
Then:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
The use of memcpy is discussed here (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away.
If you look at the edit history, I had implemented a custom version with combine operators for vectors of vectors (e.g. uint8x8x2_t
). The problem was mentioned here. However, since those data types are declared as arrays (see guide, section 12.2.2) and therefore located in consecutive memory locations, the compiler is bound to treat the memcpy
correctly.
Finally, to print the content of the variable one could use a function like this.
If you try to avoid casting in a sensible way by various data structures hackery, you'll end up shuffling memory / words around which will kill any performance you're hoping to get from NEON.
You can probably cast down quad registers to double registers easily but other way might not be possible.
Everything boils down to this. In each instruction there are a few bits to index registers. If instruction expects Quad registers it will count registers two-by-two like Q(2*n), Q(2*n+1) and only use n in encoded instruction, (2*n+1) will be implicit for core. If any point in your code you are trying to cast two double into a quad you may be in a position where those are not consecutive forcing compiler to shuffle around registers into stack and back to get consecutive layout.
I think it is still the same answer in different words https://stackoverflow.com/a/13734838/1163019
NEON instructions are designed to be streaming, you load from memory in big chunks, process it, then store what you want back. This should be all very simple mechanics, if not you'll loose extra performance it offers which will make people ask why you're trying to utilize Neon in the first place making life harder for yourself.
Think NEON as immutable value types and operations.