Substitute a byte with another one

2019-08-28 12:25发布

I am finding difficulties in creating a code for this seemingly easy problem.

Given a packed 8 bits integer, substitute one byte with another if present.

For instance, I want to substitute 0x06 with 0x01, so I can do the following with res as the input to find 0x06:

// Bytes to be manipulated
res = _mm_set_epi8(0x00, 0x03, 0x02, 0x06, 0x0F, 0x02, 0x02, 0x06, 0x0A, 0x03, 0x02, 0x06, 0x00, 0x00, 0x02, 0x06);

// Target value and substitution
val = _mm_set1_epi8(0x06);
sub = _mm_set1_epi8(0x01);

// Find the target
sse = _mm_cmpeq_epi8(res, val);

// Isolate target
sse = _mm_and_si128(res, sse);

// Isolate remaining bytes
adj = _mm_andnot_si128(sse, res);

Now I don't know how to proceed to or those two parts, I need to remove the target and substitute it with the replaced byte.

What SIMD instruction am I missing here?

As with other questions, I am limited to AVX, I have no better processor.

标签: sse simd avx
1条回答
一纸荒年 Trace。
2楼-- · 2019-08-28 13:08

What you essentially need to do is to set all bytes (of the input) which you want to substitute to zero. Then set all other bytes of the substitution to zero and OR the results. You already got a mask to do that from the _mm_cmpeq_epi8. Overall, this can be done like this:

__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_or_si128(_mm_and_si128(mask, sub), _mm_andnot_si128(mask, inp));

Since the last combination of and/andnot/or is very common, SSE4.1 introduced an instruction which (essentially) combines these into one:

__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_blendv_epi8(inp, sub, mask);

In fact, clang5.0 and later is smart enough to replace the first variant by the second, when compiled with optimization: https://godbolt.org/z/P-tcik


N.B.: If the substitution value is in fact 0x01 you can exploit the fact that the mask (the result of the comparison) is 0x00 or 0xff (which is -0x01), i.e., you can zero out the values you want to substitute and then subtract the mask:

__m128i val = _mm_set1_epi8(0x06);
__m128i mask = _mm_cmpeq_epi8(inp, val);
return _mm_sub_epi8(_mm_andnot_si128(mask, inp), mask);

This can save either loading the 0x01 vector from memory or wasting a register for it. And depending on your architecture it may have a slightly better throughput.

查看更多
登录 后发表回答