I'm trying to write some reasonably fast component-wise vector addition code. I'm working with (signed, I believe) 64-bit integers.
The function is
void addRq (int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
for(int i = 0; i < dim; i++) {
a[i] = (a[i]+b[i])%q; // LINE1
}
}
I'm compiling with icc -std=gnu99 -O3
(icc so I can use SVML later) on an IvyBridge (SSE4.2 and AVX, but not AVX2).
My baseline is removing the %q
from LINE1. 100 (iterated) function calls with dim=11221184
takes 1.6 seconds. ICC auto-vectorizes the code for SSE; great.
I really want to do modular additions though. With the %q
, ICC does not auto-vectorize the code, and it runs in 11.8 seconds(!). Even ignoring the auto-vectorization for the previous attempt, this still seems excessive.
Since I don't have AVX2, vectorization with SSE requires SVML, which is perhaps why ICC didn't auto-vectorize. At any rate, here's my attempt to vectorize the inner loop:
__m128i qs = _mm_set1_epi64x(q);
for(int i = 0; i < dim; i+=2) {
__m128i xs = _mm_load_si128((const __m128i*)(a+i));
__m128i ys = _mm_load_si128((const __m128i*)(b+i));
__m128i zs = _mm_add_epi64(xs,ys);
zs = _mm_rem_epi64(zs,qs);
_mm_store_si128((__m128i*)(a+i),zs);
}
Assembly for the main loop is:
..B3.4: # Preds ..B3.2 ..B3.12
movdqa (%r12,%r15,8), %xmm0 #59.22
movdqa %xmm8, %xmm1 #60.14
paddq (%r14,%r15,8), %xmm0 #59.22
call __svml_i64rem2 #61.9
movdqa %xmm0, (%r12,%r15,8) #61.36
addq $2, %r15 #56.30
cmpq %r13, %r15 #56.24
jl ..B3.4 # Prob 82% #56.24
So the code is getting vectorized as expected. I know I might not get a 2x speedup due to SVML, but the code runs in 12.5 seconds, slower than with no vectorization at all! Is this really the best that can be done here?