The question:
What is the most efficient sequence to generate a stride-3 gather of 32-bit elements from memory? If the memory is arranged as:
MEM = R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 ...
We want to obtain three YMM registers where:
YMM0 = R0 R1 R2 R3 R4 R5 R6 R7
YMM1 = G0 G1 G2 G3 G4 G5 G6 G7
YMM2 = B0 B1 B2 B3 B4 B5 B6 B7
Motivation and discussion
The scalar C code is something like
template <typename T>
T Process(const T* Input) {
T Result = 0;
for (int i=0; i < 4096; ++i) {
T R = Input[3*i];
T G = Input[3*i+1];
T B = Input[3*i+2];
Result += some_parallelizable_algorithm<T>(R, G, B);
}
return Result;
}
Let's say that some_parallelizable_algorithm was written in intrinsics and was tuned to the fastest possible implementation possible:
template <typename T>
__m256i some_parallelizable_algorithm(__m256i R, __m256i G, __m256i B);
So the vector implementation for T=int32_t can be something like:
template <>
int32_t Process<int32_t>(const int32_t* Input) {
__m256i Step = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i Result = _mm256_setzero_si256();
for (int i=0; i < 4096; i+=8) {
// R = R0 R1 R2 R3 R4 R5 R6 R7
__m256i R = _mm256_i32gather_epi32 (Input+3*i, Step, 3);
// G = G0 G1 G2 G3 G4 G5 G6 G7
__m256i G = _mm256_i32gather_epi32 (Input+3*i+1, Step, 3);
// B = B0 B1 B2 B3 B4 B5 B6 B7
__m256i B = _mm256_i32gather_epi32 (Input+3*i+2, Step, 3);
Result = _mm256_add_epi32 (Result,
some_parallelizable_algorithm<int32_t>(R, G, B));
}
// Here should be the less interesting part:
// Perform a reduction on Result and return the result
}
First, this can be done because there are gather instructions for 32-bit elements, but there are none for 16-bit elements or 8-bit elements. Second, and more importantly, the gather instruction above should be entirely avoided for performance reasons. It is probably more efficient to use contiguous wide loads and shuffle the loaded values to obtain the R, G and B vectors.
template <>
int32_t Process<int32_t>(const int32_t* Input) {
__m256i Result = _mm256_setzero_si256();
for (int i=0; i < 4096; i+=3) {
__m256i Ld0 = _mm256_lddqu_si256((__m256i*)Input+3*i));
__m256i Ld1 = _mm256_lddqu_si256((__m256i*)Input+3*i+1));
__m256i Ld2 = _mm256_lddqu_si256((__m256i*)Input+3*i+2));
__m256i R = ???
__m256i G = ???
__m256i B = ???
Result = _mm256_add_epi32 (Result,
some_parallelizable_algorithm<int32_t>(R, G, B));
}
// Here should be the less interesting part:
// Perform a reduction on Result and return the result
}
It seems that for power-2 strides (2, 4, ...) there are known methods using UNKPCKL/UNKPCKH, but for stride-3 accesses i could not find any references.
I am interested in solving this for T=int32_t, T=int16_t and T=int8_t, but to remain focused let's only discuss the first case.
The 8-bit integer case.
As already mentioned in the comments above, two input shuffle instructions, such as
vshufps
, don't exist for 8-bit granularity. Hence, the 8-bit solution differs a bit from the 32-bit solution. Two different solutions are described below.A straightforward approach is to group the 8-bit integers 'color by color (R G B)' with 6
vpblendvb
-s, followed by avpshufb
permutation:Instruction summary:
Unfortunately, the
vpblendvb
instruction is often relatively slow: on Intel Skylakevpblendvb
has a throughput of one per cycle and on AMD Ryzen and Intel Haswell the throughput is only one per two cylcles. Skylake-X has a fast byte blendvpblendmb
(throughput three per cycle (256-bit) ), although on Skylake-X one might be more interested in a solution that works with 512-bit vectors instead of 256-bit.An alternative is to combine
vpshufb
withvshufps
, as suggested in @Peter Cordes' comments above. In the code below the data is loaded as 12-byte chunks. Altogether more instructions are needed than in the first solution. Nevertheless, the performance of this second solution is probably better than the first solution, depending on the surrounding code and the micro architecture.Instruction summary:
It is easy to adapt the ideas of these methods to the 16-bit case.
This article from Intel describes how to do exactly the 3x8 case that you want.
That article covers the
float
case. If you wantint32
, you'll need to cast the outputs since there's no integer version of_mm256_shuffle_ps()
.Copying their solution verbatim:
So this is 11 instructions. (6 loads, 5 shuffles)
In the general case, it's possible to do an
S x W
transpose inO(S*log(W))
instructions. Where:S
is the strideW
is the SIMD widthAssuming the existence of 2-vector permutes and half-vector insert-loads, then the formula becomes:
Ignoring reg-reg moves. For degenerate cases like the
3 x 4
, it may be possible to do better.Here's the
3 x 16
load-transpose with AVX512: (6 loads, 3 shuffles, 6 blends)The inverse
3 x 16
transpose-store will be left as an exercise to the reader.The pattern is not at all trivial to see since the
S = 3
is somewhat degenerate. But if you can see the pattern, you'll be able to generalize this to any odd integerS
as well as any power-of-twoW
.