I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8
SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t
input?
相关问题
- Avoid cmake to add the flags -search_paths_first a
- AOSP Build TARGET_PRODUCT fails
- SSE Comparison Intrinsics - How to get 1 or 0 from
- Perform a horizontal logical/bitwise AND operation
- ffmpeg for Android: neon build has text relocation
相关文章
- socket() returns 0 in C client server application
- Select unique/deduplication in SSE/AVX
- Why are i2c_smbus function not available? (I2C – E
- Compact a hex number
- Problem with time() function in embedded applicati
- avoiding text relocations when mixing c/c++ and as
- Unpacking a bitfield (Inverse of movmskb)
- Interrupt handling on an SMP ARM system with a GIC
I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.
Note that I haven't tested any of this, but something like this might work:
This would need to be repeated once to process a 128-bit vector, since
vpadd
only works on 64-bit vectors.after some tests it looks like following code works correct: