可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:

The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.

回答1:

Some definitions to answer clearly...

NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).

The NEON unit can view the same register bank as:

sixteen 128-bit quadword registers, Q0-Q15

thirty-two 64-bit doubleword registers, D0-D31.

uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.

ARM NEON Intrinsics has a definition called vector array data type in ARM® C Language Extensions:

... for use in load and store operations, in table-lookup operations, and as the result type of operations that return a pair of vectors.

vzip instruction

... interleaves the elements of two vectors.

vzip Dd, Dm

and has an intrinsic like

uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)

from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.

Now the answer is...

uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).

Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.

Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.

回答2:

You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.

#include <arm_neon.h>

uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
    uint8x8x2_t tmp = vzip_u8(a,b);
    uint8x16_t result;
    result = vcombine_u8(tmp.val[0], tmp.val[1]);
    return result;
}

回答3:

I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]

uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;

It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)

回答4:

I was facing the same kind of problem, so I introduced a flexible data type.

I can now therefore define the following:

typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.