Vectorizing Modular Arithmetic

I'm trying to write some reasonably fast component-wise vector addition code. I'm working with (signed, I believe) 64-bit integers.

The function is

void addRq (int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    for(int i = 0; i < dim; i++) {
        a[i] = (a[i]+b[i])%q; // LINE1
    }
}

I'm compiling with icc -std=gnu99 -O3 (icc so I can use SVML later) on an IvyBridge (SSE4.2 and AVX, but not AVX2).

My baseline is removing the %q from LINE1. 100 (iterated) function calls with dim=11221184 takes 1.6 seconds. ICC auto-vectorizes the code for SSE; great.

I really want to do modular additions though. With the %q, ICC does not auto-vectorize the code, and it runs in 11.8 seconds(!). Even ignoring the auto-vectorization for the previous attempt, this still seems excessive.

Since I don't have AVX2, vectorization with SSE requires SVML, which is perhaps why ICC didn't auto-vectorize. At any rate, here's my attempt to vectorize the inner loop:

__m128i qs = _mm_set1_epi64x(q);
for(int i = 0; i < dim; i+=2) {
    __m128i xs = _mm_load_si128((const __m128i*)(a+i));
    __m128i ys = _mm_load_si128((const __m128i*)(b+i));
    __m128i zs = _mm_add_epi64(xs,ys);
    zs = _mm_rem_epi64(zs,qs);
    _mm_store_si128((__m128i*)(a+i),zs);
}

Assembly for the main loop is:

..B3.4:                         # Preds ..B3.2 ..B3.12
    movdqa    (%r12,%r15,8), %xmm0                          #59.22
    movdqa    %xmm8, %xmm1                                  #60.14
    paddq     (%r14,%r15,8), %xmm0                          #59.22
    call      __svml_i64rem2                                #61.9
    movdqa    %xmm0, (%r12,%r15,8)                          #61.36
    addq      $2, %r15                                      #56.30
    cmpq      %r13, %r15                                    #56.24
    jl        ..B3.4        # Prob 82%                      #56.24

So the code is getting vectorized as expected. I know I might not get a 2x speedup due to SVML, but the code runs in 12.5 seconds, slower than with no vectorization at all! Is this really the best that can be done here?

Neither SSE2 nor AVX2 have integer division instructions. Intel is disingenuous to call the SVML functions intrinsics since many of them are complicated functions which map to several instructions and not just a few.

There is a way to do faster division (and modulo) with SSE2 or AVX2. See this paper Improved division by invariant integers. Basically you precompute a divisor and then do multiplication. Precomputing the divisor takes time but for some value of dim in your code it should win out. I described this method in more detail here SSE integer division? I also successfully implemented this method in a prime number finder Finding lists of prime numbers with SIMD - SSE/AVX

Agner Fog implements 32-bit (but not 64-bit) division in his Vector Class using the method described in that paper. That would be a good place to start if you want some code but you will have to extend it to 64-bit.

Edit: Based on Mysticial's comments and assuming that the inputs are already reduced I produced a version for SSE. If this is compiled in MSVC then it needs to be in 64 bit mode as 32 bit mode does not support _mm_set1_epi64x. This can be fixed for 32 bit mode mode but I don't want to do it.

#ifdef _MSC_VER 
#include <intrin.h>
#endif
#include <nmmintrin.h>                 // SSE4.2
#include <stdint.h>
#include <stdio.h>

void addRq_SSE(int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    __m128i q2 = _mm_set1_epi64x(q);
    __m128i t2 = _mm_sub_epi64(q2,_mm_set1_epi64x(1));
    for(int i = 0; i < dim; i+=2) {
        __m128i a2 = _mm_loadu_si128((__m128i*)&a[i]);
        __m128i b2 = _mm_loadu_si128((__m128i*)&b[i]);
        __m128i c2 = _mm_add_epi64(a2,b2);
        __m128i cmp = _mm_cmpgt_epi64(c2, t2);
        c2 = _mm_sub_epi64(c2, _mm_and_si128(q2,cmp));
        _mm_storeu_si128((__m128i*)&a[i], c2);
    }
}

int main() {
    const int64_t dim = 20;
    int64_t a[dim];
    int64_t b[dim];
    int64_t q = 10;

    for(int i=0; i<dim; i++) {
        a[i] = i%q; b[i] = i%q;
    }
    addRq_SSE(a, b, dim, q);
    for(int i=0; i<dim; i++) {
        printf("%d\n", a[i]);
    }   
}