If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements?
For example, if xmm value is | r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 | the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array.
If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value (not necessary 1).
I am pretty much wondering if there is an instruction for that or preferably C/C++ intrinsic. In any SSE or AVX set of instructions.
EDIT 1:
It was correctly observed by @zx485 that original question was not clear enough. I was looking for any "consecutive" solution.
The example 0 1 0 1 0 1 0 1...
above should result in either of the following:
- If we assume that indices start from 1, then
0
would be a termination byte and the result might be
002 004 006 008 010 012 014 016 000 000 000 000 000 000 000 000
- If we assume that negative byte is a termination byte the result might be
001 003 005 007 009 011 013 015 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF
- Anything, that gives as a consecutive bytes which we can interpret as indices of non-zero elements in the original value
EDIT 2:
Indeed, as @harold and @Peter Cordes suggest in the comments to the original post, one of the possible solutions is to create a mask first (e.g. with pmovmskb
) and check non zero indices there. But that will lead to a loop.
Your question was unclear regarding the aspect if you want the result array to be "compressed". What I mean by "compressed" is, that the result should be consecutive. So, for example for
0 1 0 1 0 1 0 1...
, there are two possibilities:Non-consecutive:
Consecutive:
One problem of the consecutive approach is: how do you decide if it's index
0
or a termination value?I'm offering a simple solution to the first, non-consecutive approach, which should be quite fast:
Just for the sake of completeness: the second, the consecutive approach, is not covered in this answer.
Updated answer: the new solution is slightly more efficient.
You can do this without a loop by using the pext instruction from the Bit Manipulation Instruction Set 2 , in combination with a few other SSE instructions.
It takes about five instructions to get 0xFF (the termination byte) at the unwanted positions. Note that a function
nonz_index
that returns the indices and only the position of the termination byte, without actually inserting the termination byte(s), would be much cheaper to compute and might be as suitable in a particular application. The position of the first termination byte isnnz64>>2
.The result is:
The
pext
instruction is supported on Intel Haswell processors or newer.