ARM Neon: Store n-th position(s) of non-zero byte(

2019-07-26 21:05发布

问题:

I want to convert a Neon 64-bit vector lane to get the n-th position(s) of non-zero (aka. 0xFF) 8-bit value(s), and then fill the rest of the vector with zeros. Here are some examples:

    0  1  2  3  4  5  6  7

d0: 00 FF 00 FF 00 00 00 FF
d1: 1  3  7  0  0  0  0  0

d0: 00 FF FF FF 00 00 FF 00
d1: 1  2  3  6  0  0  0  0

d0: FF FF FF FF FF FF FF FF
d1: 0  1  2  3  4  5  6  7

d0: FF 00 00 00 00 00 00 00
d1: 0  0  0  0  0  0  0  0

d0: 00 00 00 00 00 00 00 00
d1: 0  0  0  0  0  0  0  0

I have the feeling that it's probably one or two bit-shift Neon instructions with another "good" vector. How can I do that?

回答1:

This turns out to be not at all simple.

The naïve efficient approach starts with trivially getting the indices (just load a static vector of 0 1 2 3 4 5 6 7 and vand it with the bitmask). However, in order to then collect them at one end of the output vector - in different lanes to the input lanes they represent - you then need some arbitrary permutation operation. There's only one instruction capable of arbitrarily permuting vectors, vtbl (or vtbx, which is essentially the same thing). However, vtbl takes a vector of source indices in destination order, which turns out to be the exact same thing you're trying to produce. Thus in order to produce your final result, you need to use your final result, therefore the naïve efficient solution is impossible; QED.

The fundamental problem is that what you're effectively doing is sorting a vector, which is inherently not a parallel SIMD operation. NEON is a parallel SIMD instruction set designed for media processing, and is really not cut out for any of the data-dependent/horizontal/scatter-gather operations of more general vector processing.

To prove the point, I did manage to do this in pure NEON, without any scalar code at all, and it's horrible; the best "one or two bit-shift NEON instructions" I could come up with is some conditional-select-based rotating bitmask accumulation trickery. If it's not clear, I'd suggest stepping through in a debugger or simulator to follow what it does (example):

// d0 contains input vector
vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[1]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[2]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[3]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[4]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[5]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[6]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[7]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vbic.u8 d1, d1, d3
// d1 contains output vector

Cheating and using a loop (which necessitates rotating d0 in the opposite direction such that we can access each original lane through d0[0]) makes it smaller, but not really any less awful:

vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
mov r0, #8
1:
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
subs r0, r0, #1
vext.u8 d0, d0, d0, #1
vsub.u8 d1, d1, d3
bne 1b
vbic.u8 d1, d1, d3

Ideally, if it's at all possible to rework other parts of the algorithm to avoid needing non-constant permutations of vectors, do that instead.



回答2:

You can do this by sorting the vector. That's a more complex operation than you'd expect for an operation like this, but I haven't come up with anything better.

Given a list of 00/ff bytes in d0, and the constants 0, 1, 2, ..., 7 in d1, you can make a sortable list of active columns using vorn.

vorn.u8 d0, d1, d0

Now all the unwanted lanes of d0 have been replaced with 0xff and rest have been replaced with their lane index. From there you can sort that list to cluster all the unwanted lanes at the end.

To do that, you have to extend the list to 16 bytes:

vmov.u8 d1, #255

And then split them into odd/even vectors:

vuzp.u8 d0, d1

The sort operation consists of vmin/vmax between those vectors, then a stagger operation, then another vmin/vmax pair to sort between different pairs, so that values can bubble towards their proper place. Like so:

vmin.u8 d2, d0, d1
vmax.u8 d3, d0, d1
vsri.u64 d2, d2, #8   ; stagger the even lanes (discards d0[0])
vmax.u8 d4, d2, d3    ; dst reg would be d0, but we want d0[0] back...
vmin.u8 d1, d2, d3
vsli.u64 d0, d4, #8   ; revert the stagger, and restore d0[0]

This implements two stages of the full network, and the whole block has to be repeated four times (eight stages) to make it possible for something in d0[7] to bubble all the way down to d0[0] in the extreme case of the last byte being the only nonzero input, or for d0[0] to get to d0[7] if the first byte was the only zero input.

Once you're done sorting them, stitch the result back together:

vzip.u8 d0, d1

And because you wanted zeroes in the leftover lanes:

vmov.u8 d1, #0
vmax.s8 d0, d1

And now d0 should contain the result.

If you check out Wikipedia's sorting network page you'll see that the theoretical minimum depth for eight lanes is only six stages (six pairs of vmin/vmax), so it may be possible to find a set of permutations (replacing my vsli and vsri operations) which implement a six-stage sorting network rather than the eight-stage insertion/selection/vbubble sort that I've implemented. If such does exist, and is compatible with NEON's permute operations, then it'd certainly be worth finding, but I don't have time to look.

Also note that the sort is working on a total of 16 bytes, which is more than you need, and if you used q registers you could have it work on 32 bytes... so this is a long way from maximum throughput.

Oh, and even in this configuration I think you don't need the last stage of the sort. An exercise left for the reader.