I want to multiply with SSE4 a __m128i
object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8
?
相关问题
- Where can the code be more efficient for checking
- NASM x86 print integer using extern printf
- “rdtsc”: “=a” (a0), “=d” (d0) what does this do? [
- Can a “PUSH” instruction's operation be perfor
- SSE Comparison Intrinsics - How to get 1 or 0 from
相关文章
- Is it possible to run 16 bit code in an operating
- parallelizing matrix multiplication through thread
- Select unique/deduplication in SSE/AVX
- SIMD/SSE: How to check that all vector elements ar
- x86 instruction encoding tables
- x86 Program Counter abstracted from microarchitect
- Assembler : why BCD exists?
- Fastest way to compute distance squared
A (potentially) faster way than Marat's solution based on Agner Fog's solution:
Instead of splitting hi/low, split odd/even. This has the added benefit that it works with pure SSE2 instead of requiring SSE4.1 (of no use to the OP, but a nice added bonus for some). I also added an optimization if you have AVX2. Technically the AVX2 optimization works with only SSE2 intrinsics, but it's slower than the shift left then right solution.
Agner uses the
blendv_epi8
intrinsic with SSE4.1 support.Edit:
Interestingly, after doing more disassembly work (with optimized builds), at least my two implementations get compiled to exactly the same thing. Example disassembly targeting "ivy-bridge" (AVX).
It uses the "AVX2-optimized" version with a pre-compiled 128-bit xmm constant. Compiling with only SSE2 support produces a similar results (though using SSE2 instructions). I suspect Agner Fog's original solution might get optimized to the same thing (would be crazy if it didn't). No idea how Marat's original solution compares in an optimized build, though for me having a single method for all x86 simd extensions newer than and including SSE2 is quite nice.
The only 8 bit SSE multiply instruction is PMADDUBSW (SSSE3 and later, C/C++ intrinsic: _mm_maddubs_epi16). This multiplies 16 x 8 bit unsigned values by 16 x 8 bit signed values and then sums adjacent pairs to give 8 x 16 bit signed results. If you can't use this rather specialised instruction then you'll need to unpack to pairs of 16 bit vectors and use regular 16 bit multiply instructions. Obviously this implies at least a 2x throughput hit so use the 8 bit multiply if you possibly can.
There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows: