My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization.
The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in release build, with all optimization options on.
float a[10000], b[10000], c[10000];
for (int x = 0; x < 10000; x++)
1000988F xor eax,eax
10009891 mov dword ptr [ebp-9C8Ch],ecx
c[x] = (a[x] + b[x])*b[x];
10009897 vmovups xmm1,xmmword ptr c[eax]
100098A0 vaddps xmm0,xmm1,xmmword ptr c[eax]
100098A9 vmulps xmm0,xmm0,xmm1
100098AD vmovups xmmword ptr c[eax],xmm0
100098B6 vmovups xmm1,xmmword ptr [ebp+eax-9C78h]
100098BF vaddps xmm0,xmm1,xmmword ptr [ebp+eax-9C78h]
100098C8 vmulps xmm0,xmm0,xmm1
100098CC vmovups xmmword ptr [ebp+eax-9C78h],xmm0
100098D5 add eax,20h
100098D8 cmp eax,9C40h
100098DD jl ComputeTempo+67h (10009897h)
const int winpts = (int)(window_size*sr+0.5);
100098DF vxorps xmm1,xmm1,xmm1
100098E3 vcvtsi2ss xmm1,xmm1,ecx
I have also tested that I can use ymm* registers to further speed up my program without crashing. I did that using IMM intrinsics, e.g. _mm256_mul_ps.
Can any Microsoft compiler developer give an explanation? Or maybe that is one of the reasons why Visual Studio gives slower codes than gcc/g++ compiler?
=============edited==============
The reason turns out to be that there exist some difference between running 32-bit OS on 32-bit machine and running 32-bit OS on 64-bit machine. In the latter case, some OS might not know the existence of ymm* registers and thus does not preserve the upper half registers properly during a context switch. Thus, if ymm* registers are used on 32-bit OS on 64-bit machine, if a context switch occurs, the upper half registers might get silently corrupted if another program is also using ymm* registers. Visual Studio is kind of conservative in this context.
Yes, it was 32-bit/64-bit problem. Compiling in x64 mode does not have the problem. However, for some reason, my program has to be compiled in 32-bit mode as it was a plugin of some sort where only 32-bit is supported. Nonetheless, it is still contradictory that even in 32-bit mode, setting /arch:AVX2 will allow the compiler to access ymm* registers.
From Intel specification, http://www.felixcloutier.com/x86/ADDPS.html, it says that "in 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers (XMM8-XMM15)." Also in http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html, it is stated that 32-bit programs can access ymm* registers in both 32-bit and 64-bit operation systems. The only restriction is that in 32-bit mode, you don't have access to xmm8-xmm15 nor ymm8-ymm15 because the instructions are shorter. That is why I am able to manually use intrinsic functions to access the ymm* registers without causing an illegal instruction crash.
So in conclusion, unless there exists some CPUs that support only AVX but not AVX2, will encounter some problems accessing ymm* registers in 32-bit mode, (which has already been proven not to be the case), the above-mentioned restriction is not necessary. And I still hope Visual C++ compiler can be improved to make this optimization option available since many computers support only AVX but not AVX2, and using ymm* registers can double the performance of floating point arithmetic.
I made a text file
vec.cpp
went to the command line with Visual Studio 2015 x86 x64 enabled and did
looked at the file
vec.asm
and I seeThe problem is that you are compiling in 32-bit mode. Using the same function above but compiling in 32-bit mode I get