For the following function, the code with optimizations is vectorized and the computation is performed in registers (the return value is returned in eax
). Generated machine code is, e.g., here: https://godbolt.org/z/VQEBV4.
int sum(int *arr, int n) {
int ret = 0;
for (int i = 0; i < n; i++)
ret += arr[i];
return ret;
}
However, if I make ret
variable global (or, a parameter of type int&
), the vectorization is not used and the compiler stores the updated ret
in each iteration to memory. Machine code: https://godbolt.org/z/NAmX4t.
int ret = 0;
int sum(int *arr, int n) {
for (int i = 0; i < n; i++)
ret += arr[i];
return ret;
}
I don't understand why the optimizations (vectorization/computations in registers) are prevented in the latter case. There is no threading, even the increments are not performed atomically. Moreover, this behavior seems to be consistent across compilers (GCC, Clang, Intel), so I believe there must be some reason for it.