可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am new to GCC's C vector extensions. I am considering use of them in my project, but their utility is (somewhat) contingent on the ability to efficiently move all elements in a vector one position to the left and store the result in a new vector. How can I do this efficiently (such as in a SIMD-accelerated way)?

So, basically:

OriginalVector = {1, 2, 3, 4, 5, 6, 7, 8}
ShiftedVector = {2, 3, 4, 5, 6, 7, 8, X} (where X can be anything.)

Background information (you can skip this): The purpose of such a transformation is in dealing with matrices where each row is represented with vectors. Specifically, it would enable one to treat ShiftedVector as the upper-left diagonal for the row beneath, and compare all values in one SIMD operation. If there is another way to compare a vector with another vector offset by one element, that would solve the problem too. But I'm assuming not, and that the most efficient way to perform this comparison is to move all the elements leftward and do the comparison 1:1.

General stipulations:

The original vector mustn't be harmed in the process
It is fine if I have to use an x86 intrinsic function of some sort, but I don't know which or how
It is fine if I lose the left-most element in the vector and introduce gibberish in the right-most
It is fine if the most efficient method is an unaligned load of the original vector from its second position to end+1, but I still would like to know how to best code this

It seems the bottleneck here is the lack of general information on the process of using the intrinsics. It seems people are either using assembly (which I am no expert in) or auto-vectorization (which doesn't work well here), so vector types are the most logical choice.

Thanks!

回答1:

Crawling around in the depths of the manual, I uncovered this bit of tomfoolery:

typedef int v8si __attribute__ ((vector_size (32)));
v8si OriginalVector, masker, ShiftedVector;
OriginalVector = {1, 2, 3, 4, 5, 6, 7, 8};
masker = {1,2,3,4,5,6,7,0};
ShiftedVector = __builtin_shuffle(OriginalVector, masker);

Where I put a 0 at the end of "masker" for no reason (any element 0-7 would work). What this does is just map the elements in the original to the positions defined in masker, and save them to the result.

But although this is an answer, it may not be the "best" answer, since I imagine there is a better way than creating a new vector, taking up a register with the new vector, assigning positions, taking each element out of place and putting it in another arbitrary place, and saving the result.

Yes, we can cache the masker outside the loop or something instead of creating it every time, but I imagine there's some simple "permute left" instruction somewhere which can just slide it over...

回答2:

The fastest shift is no shift at all (i.e. no move, no copy):

int Data[16] = {
    1, 2, 3, 4, 5, 6, 7, 8,
    0, 0, 0, 0, 0, 0, 0, 0,
};

int* Ptr = Data;
// first shift
Ptr++;
// second shift
Ptr++;
// and so on.

If the algorithm allows that (i.e. the number of shifts is limited and known in advance) it's possible to reserve enough space, and make "shifts" just by incrementing a pointer.