I have the following function I'm trying to write an AXV version for:
void
hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length)
{
size_t i, j, v, p;
char temp;
if (!salt_length) {
return;
}
for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) {
v %= salt_length;
p += salt[v];
j = (salt[v] + v + p) % i;
temp = str[i];
str[i] = str[j];
str[j] = temp;
}
}
I'm trying to vectorize v %= salt_length;
.
I want to initialize a vector that contains numbers from 0 to str_length in order to use SVML's _mm_rem_epu64 in order to calculate v for each loop iteration.
How do I initialize the vector correctly?
Asking just how to initialize a vector is basically asking for a tutorial. Google up some of Intel's guides to using intrinsics. I'm going to pretend the question wasn't trivial, and answer about trying to implement this function efficiently. It's definitely not the kind of function you'd try to vectorize as a total beginner.
See the x86 tag wiki for links to docs, esp. Intel's intrinsics guide. See the sse tag wiki for a link to a very nice intro to SIMD programming with SSE intrinsics, and how to use SIMD effectively, among other links.
Contents summary:
v % salt_length
for free.v++; v %= loop_invariant;
if it wasn't a power of 2 or compile-time constant. Includes an answer to the title question about using_mm_set_epi8
or other ways of initializing a vector for this purpose.an untested version of the complete function that vectorizes everything except the
% i
and the swap. (i.e. vectorizes all the operations that were cheap anyway, like you asked).(v + salt[v] + p)
(and everything leading up to it) vectorizes to twovpaddw
instructions. The prefix-sum setup outside the loop for vectorizingp
was tricky, but I eventually vectorized it, too.The vast majority of the function's run time will be in the scalar inner loop over a vector of
j
elements, bottlenecking ondiv
(or whatever SVML can do), and/or cache misses with very large strings.The entire loop can't easily vectorize because the swaps with pseudo-random indices create an unpredictable serial dependency. Using AVX512 gather + shuffle + scatter, with AVX512CD to find conflict bitmasks, might be possible, but that would have to be a separate question. I'm not sure how hard it would be to do this efficiently, or if you'd often end up repeating a vector shuffle many times, only making progress in one non-conflicting element.
Since
salt_length = sizeof(size_t)
is a compile-time constant and a power of 2 smaller than your vector length,v++
andv%=salt_length
don't require any code inside the loop at all, and happens for free as a side-effect of effectively unrolling the loop to do multiplev
values in parallel.(Using a platform-dependent salt size means that a 32-bit build won't be able to process data created with 64-bit salt. Even the x32 ABI has 32-bit
size_t
, so changing to uint64_t would seem to make sense, unless you never need to share salted hashes between machines.)In the scalar loop,
v
follows a repeating pattern of 0 1 2 3 0 1 2 3 ... (or 0..7, for 64-bit). In vector code, we're doing maybe 8v
values at once with 4B elements in a 32-byte vector, or 16 iterations at once with 2B elements.So
v
becomes a loop-invariant constant vector. Interestingly, so doessalt[v]
, so we never need to do any salt table lookups inside the loop. In fact,v+salt[v]
can be pre-computed for scalar and vector.The scalar version should pre-compute
v+salt[v]
and unroll by 4 or 8, too, removing the LUT lookup so all the memory/cache throughput is available for the actual swaps. The compiler probably won't do this for you, so you probably need to unroll manually and write extra code to handle the last odd number of string bytes. (Without unrolling, you could still pre-compute a lookup table ofv+salt[v]
, with a type wide enough not to wrap around).Even just making sure
salt_length
is known at compile time would also allow much better code.v %= compile_time_constant
is cheaper than adiv
insn, and extremely cheap when it's a power of 2. (It just turns intov &= 7
). The compiler might possibly do this for you if the scalar version can inline, or if you usedsalt_length = sizeof(size_t)
instead of passing it as a function arg at all.If you didn't already know salt_length: i.e. what @harold was suggesting before you revealed the critical information about salt_length:
Since we know
v < salt_length
to start with, we only ever need at most onev -= salt_length
to wrap it back into the right range and maintain that invariant. This is called a "strength reduction" operation, because subtraction is a weaker (and cheaper) operation than division.To vectorize just this: let's pretend that all we know is
salt_length <= 16
, so we can use a vector of 32 uint8_t values. (And we can use pshufb to vectorize thesalt[v]
LUT lookup).Of course, if we knew salt_length was a power of 2, we should have just masked off all but the relevant low bits in each element:
Noticing that we started with the wrong element size is when we realize that only vectorizing one line at a time was a bad idea, because now we have to go change all the code we already wrote to use wider elements, sometimes requiring a different strategy to do the same thing.
Of course, you do need to start with a rough outline, mental or written down, for how you might do each step separately. It's in the process of thinking this through that you see how the different parts might fit together.
For complex loops, a useful first step might be trying to manually unroll the scalar loop. That will help find serial dependencies, and things that simplify with unrolling.
(stuff) % i
: The hard partWe need elements wide enough to hold the maximum value of
i
, becausei
is not a power of 2, and not constant, so that modulo operation takes work. Any wider is a waste and cuts our throughput. If we could vectorize the whole rest of the loop, it might even be worth specializing the function with different versions for different ranges ofstr_length
. (Or maybe looping with 64b elements until i<= UINT32_MAX, then loop until i<=UINT16_MAX, etc). If you know you don't need to handle strings > 4GiB, you can speed up the common case by only using 32-bit math. (64-bit division is slower than 32-bit division, even when the upper bits are all zero).Actually, I think we need elements as wide as the maximum
p
, since it keeps accumulating forever (until it wraps at 2^64 in the original scalar code). Unlike with a constant modulus, we can't just usep%=i
to keep it in check, even though modulo is distributive.(123 % 33) % (33-16) != 123 % (33-16)
. Even aligning to 16 doesn't help: 12345 % 32 != 12345 % 48 % 32This will quickly make
p
too large for repeated conditional subtraction ofi
(until the condition mask is all false), even for fairly largei
values.There are tricks for modulo by known integer constants (see http://libdivide.com/), but AFAIK working out the multiplicative modular inverse for a nearby divisor (even with a power-of-two stride like 16) isn't any easier than for a totally separate number. So we couldn't cheaply just adjust the constants for the next vector of
i
values.The law of small numbers perhaps makes it worth-while to peel off the last couple vector iterations, with pre-computed vectors of multiplicative modular inverses so the
% i
can be done with vectors. Once we're close to the end of the string, it's probably hot in L1 cache so we're totally bottlenecked ondiv
, not the swap loads/stores. For this, we'd maybe use a prologue to reach ani
value that was a multiple of 16, so the last few vectors as we approach i=0 always have the same alignment ofi
values. Or else we'd have a LUT of constants for a range of i values, and simply do unaligned loads from it. That means we don't have to rotatesalt_v
andp
.Possibly converting to FP would be useful, because recent Intel CPUs (especially Skylake) have very powerful FP division hardware with significant pipelining (throughput : latency ratio). If we can get exact results with the right choice of rounding, this would be great. (
float
anddouble
can exactly represent any integer up to about the size of their mantissa.)I guess it's worth trying Intel's
_mm_rem_epu16
(with a vector ofi
values that you decrement with a vector ofset1(16)
). If they use FP to get accurate results, that's great. If it just unpacks to scalar and does integer division, it would waste time getting values back in a vector.Anyway, certainly the easiest solution is to iterate of vector elements with a scalar loop. Until you come up with something extremely fancy using AVX512CD for the swaps, it seems reasonable, but it's probably about an order of magnitude slower than just the swaps would be, if they're all hot in L1 cache.
(untest) partially-vectorized version of the function:
Here's the code on the Godbolt compiler explorer, with full design-notes comments, including the diagrams I made while figuring out the SIMD prefix-sum algo. I eventually remembered I'd seen a narrower version of this as a building block in @ZBoson's floating point SSE Prefix sum answer, but not until after mostly re-inventing it myself.
This compiles to asm that looks about right, but I haven't run it.
Clang makes a fairly sensible inner loop. This is with
-fno-unroll-loops
for readability. Leave that out for performance, although it won't matter here since loop overhead isn't the bottleneck.[disclaimer: considering 32 bits application - in which size_t is unsigned int]
Aligned allocation can be done by using aligned_malloc functions, which you can find both for windows and linux.
Allocating your string this way (to a 64 byte boundary) will allow you to load data directly into _mm256i registers by using aligned loads _mm256_load_si256 for all bytes.
If the string is not properly aligned, you can load the first bytes using unaligned loads _mm256_loadu_si256 for bytes at the beginning.
The first modulo operation you perform (v %= salt_length) is done with a constant operand, which you can initialize before the loop in an avx register by using _mm256_set1_epi32:
for the next one, you can use _mm256_set_epi32 which initializes a register with all values provided (beware of reverse order).
Please note that if you're using AVX2 or AVX512 (and not just AVX -- your question is a bit confusing about the instruction set), you can also load data using gather instructions which load data from a vector at indexes specified in a second argument.
The smallest numbers you can use with AVX512 in remainder operations are 8 bit, the intrinsic is:
_mm512_rem_epu8
However, to initialize its input with values from 0 to N you would need to use
_mm512_set_epi32
and pass it 8-bit integers packed into 32-bit integers, because there doesn't seem to be an intrinsic taking 64 integers 8-bit each. The code would look like:Consider writing a code generator for this, if you don't like such typing or afraid of typos.
Usage of
__m512i
type should take care of the alignment for you automatically, if you don't allocate it withmalloc()
. Otherwise, search "aligned malloc" for your compiler. The alignment you need is 64 bytes (equal to 512 bits).When you need the next
N
integers in the vector, you can do:And then you can add
inc
andsseConst
( intrinsic_mm512_add_epi32
).