I am finding difficulties in creating a code for this seemingly easy problem.
Given a packed 8 bits integer, substitute one byte with another if present.
For instance, I want to substitute 0x06
with 0x01
, so I can do the following with res
as the input to find 0x06
:
// Bytes to be manipulated
res = _mm_set_epi8(0x00, 0x03, 0x02, 0x06, 0x0F, 0x02, 0x02, 0x06, 0x0A, 0x03, 0x02, 0x06, 0x00, 0x00, 0x02, 0x06);
// Target value and substitution
val = _mm_set1_epi8(0x06);
sub = _mm_set1_epi8(0x01);
// Find the target
sse = _mm_cmpeq_epi8(res, val);
// Isolate target
sse = _mm_and_si128(res, sse);
// Isolate remaining bytes
adj = _mm_andnot_si128(sse, res);
Now I don't know how to proceed to or
those two parts, I need to remove the target and substitute it with the replaced byte.
What SIMD instruction am I missing here?
As with other questions, I am limited to AVX, I have no better processor.
What you essentially need to do is to set all bytes (of the input) which you want to substitute to zero. Then set all other bytes of the substitution to zero and OR the results. You already got a mask to do that from the
_mm_cmpeq_epi8
. Overall, this can be done like this:Since the last combination of and/andnot/or is very common, SSE4.1 introduced an instruction which (essentially) combines these into one:
In fact, clang5.0 and later is smart enough to replace the first variant by the second, when compiled with optimization: https://godbolt.org/z/P-tcik
N.B.: If the substitution value is in fact
0x01
you can exploit the fact that the mask (the result of the comparison) is0x00
or0xff
(which is-0x01
), i.e., you can zero out the values you want to substitute and then subtract the mask:This can save either loading the
0x01
vector from memory or wasting a register for it. And depending on your architecture it may have a slightly better throughput.