Problem
Are there any computationally feasible approaches to intra-register deduplication of a set of integers using x86 SIMD instructions?
Example
We have a 4-tuple register R1 = {3, 9, 2, 9}, and wish to obtain register R2 = {3, 9, 2, NULL}.
Restrictions
Stablility. Preservation of the input order is of no significance.
Output. However, any removed values/NULLs must be at the beginning and/or end of the register:
- {null, 1, 2, 3} - OK
- {1, 2, null, null} - OK
- {null, 2, null, null} - OK
- {null, 2, null, 1} - Invalid order
- {null, null, null, null} - Invalid output
It is obviously a bonus if it is known to produce one particular output format. Please assume NULL to effectively mean 0 (zero).
Generality. Must be able to tolerate the absence of duplicates, and in this case produce an output equivalent to the input register.
Instruction sets. I'm looking for solutions for any or all of: SSE2-SSSE3; SSE4.x; AVX-AVX2
Solution
The proposed solution always puts all the unique elements to the lower part of the output, ordered by first occurences. The higher part is zeroed. It is easy to change placing strategy by modifying LUT: put elements to higher part, or reverse their order.
static __m128i *const lookup_hash = (__m128i*) &lookup_hash_chars[0][0];
static inline __m128i deduplicate4_ssse3(__m128i abcd) {
__m128i bcda = _mm_shuffle_epi32(abcd, _MM_SHUFFLE(0, 3, 2, 1));
__m128i cdab = _mm_shuffle_epi32(abcd, _MM_SHUFFLE(1, 0, 3, 2));
uint32_t mask1 = _mm_movemask_epi8(_mm_cmpeq_epi32(abcd, bcda));
uint32_t mask2 = _mm_movemask_epi8(_mm_cmpeq_epi32(abcd, cdab));
uint32_t maskFull = (mask2 << 16U) + mask1;
//Note: minimal perfect hash function here
uint32_t lutIndex = (maskFull * 0X0044CCCEU) >> 26U;
__m128i shuf = lookup_hash[lutIndex];
return _mm_shuffle_epi8(abcd, shuf);
}
Full code (with testing) is available here.
I've also implemented a simple scalar solution by sorting network of 5 comparators, followed by serial comparison of consecutive elements. I was using MSVC2013 on two processors: Core 2 E4700 (Allendale, 2.6 Ghz) and Core i7-3770 (Ivy Bridge, 3.4 Ghz). Here are the timings in seconds for 2^29 calls:
// Allendale
SSE: time = 3.340 // ~16.2 cycles (per call)
Scalar: time = 17.218 // ~83.4 cycles (per call)
// Ivy Bridge
SSE: time = 1.203 // ~ 7.6 cycles (per call)
Scalar: time = 11.673 // ~73.9 cycles (per call)
Discussion
Note that the result must consist of two types of elements:
- elements from the input vector,
- zeros.
However, the necessary shuffling mask is determined at runtime, and in a very complex way. All the SSE instructions can deal only with immediate (i.e. compile-time constant) shuffling masks, except for one. It is _mm_shuffle_epi8
intrinsic from SSSE3. In order to get shuffling mask quickly, all the masks are stored in a lookup table, indexed by some bitmasks or hashes.
To obtain shuffling mask for a given input vector, it is necessary to collect enough information about equal elements in it. Note that it is fully enough to know which pairs of elements are equal in order to determine how to deduplicate them. If we want to additionally sort them, then we need to know also how the different elements compare to each other, which increases amount of information, and subsequently lookup table. That's why I'll show deduplication without sorting here.
So we have four 32-bit elements in an XMM register. They compose six pairs in total. Since we can only compare four elements at a time, we need at least two comparisons. In fact, it is easy to do two XMM comparisons, so that each pair of elements gets compared at least once. After that we can extract 16-bit bitmasks of comparisons by using _mm_movemask_epi8
and concatenate them into a single 32-bit integer. Note that each 4-bit block would contain same bits for sure, and the last two 4-bit blocks are not necessary (they correspond to excessive comparisons).
Ideally, we need to extract from this bitmask exactly 6 bits located in compile-time known positions. It can be easily achieved with _pext_u32
intrinsic from BMI2 instruction set. As a result, we have an integer in range [0..63] containing 6 bits, each bit shows whether the corresponding pair of elements is equal. Then we load a shuffling mask from precomputed 64-entry lookup table, and shuffle our input vector using _mm_shuffle_epi8
.
Unfortunately, BMI instructions are quite new (Haswell and later), and I do not have them =) In order to get rid of it, we can try to create a very simple and fast perfect hash function for all the 64 valid bitmasks (recall that bitmasks are 32-bit). For hash functions in class f(x) = (a * x) >> (32-b)
it is usually possible to construct a rather small perfect hash, with 2x or 3x memory overhead. Since our case is special, it is possible to construct a minimal perfect hash function, so that the lookup table has minimal 64 entries (i.e. size = 1 KB).
The same algorithm is not feasible for 8 elements (e.g. 16-bit integers in XMM register), because there are 28 pairs of elements, which means that lookup table must contain at least 2^28 entries.
Using this approach for 64-bit elements in a YMM register is also problematic. _mm256_shuffle_epi8
intrinsic does not help, because it simply performs two separate 128-bit shuffles (never shuffles across lanes). _mm256_permutevar8x32_epi32
intrinsic performs arbitrary shuffling of 32-bit blocks, but it cannot insert zeros. In order to use it, you'll have to store number of unique elements in LUT too. Then you'll have to put zeros into the higher part of your register manually.
UPDATE: hash/BMI removed
I have realized that using BMI2 for bit extraction or perfect hash function are not necessary, we can simply use _mm_movemask_ps
to extract 32-bit masks. This approach may suffer from minor latency issues because we mix INT and FP computations, but it works faster in practice.
static __m128i *const lookup_direct_offset = lookup_direct - 0xC0U;
static inline __m128i deduplicate4_ssse3_direct(__m128i abcd) {
__m128i bcda = _mm_shuffle_epi32(abcd, _MM_SHUFFLE(0, 3, 2, 1));
__m128i cdcd = _mm_shuffle_epi32(abcd, _MM_SHUFFLE(3, 2, 3, 2));
uint32_t mask1 = _mm_movemask_ps(_mm_castsi128_ps(_mm_cmpeq_epi32(abcd, bcda)));
uint32_t mask2 = _mm_movemask_ps(_mm_castsi128_ps(_mm_cmpeq_epi32(abcd, cdcd)));
uint32_t maskFull = 16U * mask2 + mask1;
//Note: use index directly
uint32_t lutIndex = maskFull;
__m128i shuf = lookup_direct_offset[lutIndex];
return _mm_shuffle_epi8(abcd, shuf);
}
The full code is updated too. This results in minor performance improvement:
// Ivy Bridge
new: Time = 1.038 (782827520) // ~ 6.6 cycles (per call)
old: Time = 1.169 (782827520) // ~ 7.4 cycles (per call)
Naive solution
Crude pseudo-code based on the Max() operation. Comments track the data for the first iteration.
A = RIN //{3, 9, 2, 9}
For i = 0 .. 3:
B = Rotate(A, 1) //{9, 2, 9, 3}
C = Rotate(A, 2) //{2, 9, 3, 9}
D = Rotate(A, 3) //{9, 3, 9, 2}
RMAX = Max(A,B) //{9, 9, 9, 9}
RMAX = Max(RMAX, C) //{9, 9, 9, 9}
RMAX = Max(RMAX, D) //{9, 9, 9, 9}
ROUT[i] = RMAX[0] //ROUT = {9, null, null, null}
TMP = A
MASK = Equality(RMAX, TMP) //MASK = {0, 1, 0, 1}
MASK = Invert(MASK) //MASK = {1, 0, 1, 0}
Clear(A)
A = MoveMasked(TMP, MASK) //A = {3, null, 2, null}
Some thoughts:
A = RIN //{3, 9, 2, 9}
B = Rotate(A, 1) //{9, 2, 9, 3}
C = Rotate(A, 2) //{2, 9, 3, 9}
D = Rotate(A, 3) //{9, 3, 9, 2}
maskA = cmpeq(A,B) //{0, 0, 0, 0}
maskB = cmpeq(A,C) //{0, -1, 0, -1}
maskC = cmpeq(A,D) //{0, 0, 0, 0}
indexA = horSum( { 1,2,4,8 } * maskA ) // 0
indexB = horSum( { 1,2,4,8 } * maskB ) // 10
indexC = horSum( { 1,2,4,8 } * maskC ) // 0
// The problem is this function here
// Of the 4096 possible indexABC only a subset will occur
// Based on an enumeration of all possible indexes a pattern
// for an lookup table could possibly be found
shuffleConst = lookupShuffle( indexA, indexB, indexC )
shuffle(A, shuffleConst)