Arstechnia recently had an article Why are some programming languages faster than others. It compares Fortran and C and mentions summing arrays. In Fortran it's assumed that arrays don't overlap so that allows further optimization. In C/C++ pointers to the same type may overlap so this optimization can't be used in general. However, in C/C++ one can use the restrict
or __restrict
keyword to tell the compiler not to assume the pointers overlap. So I started looking into this in regards to auto-vectorization.
The following code vectorizes in GCC and MSVC
void dot_int(int *a, int *b, int *c, int n) {
for(int i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
I tested this with and without overlapping arrays and it gets the correct result. However, the way I would vectorize this loop manually with SSE does not handle overlapping arrays.
int i=0;
for(; i<n-3; i+=4) {
__m128i a4 = _mm_loadu_si128((__m128i*)&a[i]);
__m128i b4 = _mm_loadu_si128((__m128i*)&b[i]);
__m128i c4 = _mm_add_epi32(a4,b4);
_mm_storeu_si128((__m128i*)c, c4);
}
for(; i<n; i++) {
c[i] = a[i] + b[i];
}
Next I tried using __restrict
. I assumed that since the compiler could assume the arrays don't overlap it would not handle overlapping arrays but both GCC and MSVC still get the correct result for overlapping arrays even with __restrict
.
void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
for(int i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
Why does the auto-vectorized code with and without __restrict
get the correct result for overlapping arrays?.
Here is the full code I used to test this:
#include <stdio.h>
#include <immintrin.h>
void dot_int(int *a, int *b, int *c, int n) {
for(int i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n");
}
void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
for(int i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n");
}
void dot_int_SSE(int *a, int *b, int *c, int n) {
int i=0;
for(; i<n-3; i+=4) {
__m128i a4 = _mm_loadu_si128((__m128i*)&a[i]);
__m128i b4 = _mm_loadu_si128((__m128i*)&b[i]);
__m128i c4 = _mm_add_epi32(a4,b4);
_mm_storeu_si128((__m128i*)c, c4);
}
for(; i<n; i++) {
c[i] = a[i] + b[i];
}
for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n");
}
int main() {
const int n = 100;
int a[] = {1,1,1,1,1,1,1,1};
int b1[] = {1,1,1,1,1,1,1,1,1};
int b2[] = {1,1,1,1,1,1,1,1,1};
int b3[] = {1,1,1,1,1,1,1,1,1};
int c[8];
int *c1 = &b1[1];
int *c2 = &b2[1];
int *c3 = &b3[1];
dot_int(a,b1,c, 8);
dot_int_SSE(a,b1,c,8);
dot_int(a,b1,c1, 8);
dot_int_restrict(a,b2,c2,8);
dot_int_SSE(a,b3,c3,8);
}
The output (from MSVC)
2 2 2 2 2 2 2 2 //no overlap default
2 2 2 2 2 2 2 2 //no overlap with manual SSE vector code
2 3 4 5 6 7 8 9 //overlap default
2 3 4 5 6 7 8 9 //overlap with restrict
3 2 2 2 1 1 1 1 //manual SSE vector code
Edit:
Here is another inserting version which produces much simpler code
void dot_int(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
a = (int*)__builtin_assume_aligned (a, 16);
b = (int*)__builtin_assume_aligned (b, 16);
c = (int*)__builtin_assume_aligned (c, 16);
for(int i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
}
If you use "restrict" with overlapping arrays, you will get undefined behaviour. That's what you get in the "overlap restrict" case. Undefined behaviour means anything can happen. And it did. Just by coincidence, the behaviour was the same as without "restrict". Absolutely correct. It falls straight under the definition of "anything can happen". Nothing to complain about.
I think I understand what is going on now. It turns out that MSVC and GCC give different results with
__restrict
. MSVC gets the correct answer with the overlap and GCC does not. I think it's fair to conclude then that MSVC is ignoring the__restrict
keyword and GCC is using it to optimize further.The output from GCC
We can produce a purely vectorized function which gives almost as many assembly lines as C doing (all the code is generated with
-O3
in GCC 4.9.0) this:Produces
However, if we remove the
__restrict on a
which allows a to overlap with c I have determined by looking at the assembly thatIs identical to
In other words if the a and c pointers are within 16 bytes of each other (|a-c|<4) then the code branches to a non-vectorized form. This confirms my guess that the auto-vectorized code includes both a vectorized and non-vectorized version to handle overlap.
On the overlapping arrays this gets the correct result: 2 3 4 5 6 7 8 9
I don't get what the problem is. Testing on Linux/64 bit, GCC 4.6, -O3, -mtune=native, -msse4.1 (i.e. a very old compiler/system), this code
compiles to this inner loop:
While this code
compiles to this:
As you can clearly see there's one less load. I guess it correclty estimated that there's no need to explicitely load memory before performing the sum, as the result won't overwrite anythng.
There's also room for way more optimizations -- GCC doesn't know that the parameters are f.i. 128 bit aligned, hence it must generate a huge preamble to check that there are no alignment issues (YMMV), and a postable to deal with extra unaligned parts (or less wide than 128 bits). This actually happens with both versions above. This is the complete code generated for
dot_int
:Now in your case the ints effectively not aligned (as they're on the stack), but if you can make them aligned and tell GCC so, then you can improve code generation:
This generates this code, with no preamble:
Note the usage of the aligned 128 bit load instructions (
movdqa
, a as aligned, vsmovdqu
, unaligned).