Visual Studio 2010 - 2015 does not use ymm* regist

2019-07-11 13:55发布

My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization.

The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in release build, with all optimization options on.

    float a[10000], b[10000], c[10000];
    for (int x = 0; x < 10000; x++)
1000988F  xor         eax,eax  
10009891  mov         dword ptr [ebp-9C8Ch],ecx  
        c[x] = (a[x] + b[x])*b[x];
10009897  vmovups     xmm1,xmmword ptr c[eax]  
100098A0  vaddps      xmm0,xmm1,xmmword ptr c[eax]  
100098A9  vmulps      xmm0,xmm0,xmm1  
100098AD  vmovups     xmmword ptr c[eax],xmm0  
100098B6  vmovups     xmm1,xmmword ptr [ebp+eax-9C78h]  
100098BF  vaddps      xmm0,xmm1,xmmword ptr [ebp+eax-9C78h]  
100098C8  vmulps      xmm0,xmm0,xmm1  
100098CC  vmovups     xmmword ptr [ebp+eax-9C78h],xmm0  
100098D5  add         eax,20h  
100098D8  cmp         eax,9C40h  
100098DD  jl          ComputeTempo+67h (10009897h)  


    const int   winpts = (int)(window_size*sr+0.5);
100098DF  vxorps      xmm1,xmm1,xmm1  
100098E3  vcvtsi2ss   xmm1,xmm1,ecx  

I have also tested that I can use ymm* registers to further speed up my program without crashing. I did that using IMM intrinsics, e.g. _mm256_mul_ps.

Can any Microsoft compiler developer give an explanation? Or maybe that is one of the reasons why Visual Studio gives slower codes than gcc/g++ compiler?

=============edited==============

The reason turns out to be that there exist some difference between running 32-bit OS on 32-bit machine and running 32-bit OS on 64-bit machine. In the latter case, some OS might not know the existence of ymm* registers and thus does not preserve the upper half registers properly during a context switch. Thus, if ymm* registers are used on 32-bit OS on 64-bit machine, if a context switch occurs, the upper half registers might get silently corrupted if another program is also using ymm* registers. Visual Studio is kind of conservative in this context.

2条回答
迷人小祖宗
2楼-- · 2019-07-11 14:10

Yes, it was 32-bit/64-bit problem. Compiling in x64 mode does not have the problem. However, for some reason, my program has to be compiled in 32-bit mode as it was a plugin of some sort where only 32-bit is supported. Nonetheless, it is still contradictory that even in 32-bit mode, setting /arch:AVX2 will allow the compiler to access ymm* registers.

From Intel specification, http://www.felixcloutier.com/x86/ADDPS.html, it says that "in 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers (XMM8-XMM15)." Also in http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html, it is stated that 32-bit programs can access ymm* registers in both 32-bit and 64-bit operation systems. The only restriction is that in 32-bit mode, you don't have access to xmm8-xmm15 nor ymm8-ymm15 because the instructions are shorter. That is why I am able to manually use intrinsic functions to access the ymm* registers without causing an illegal instruction crash.

So in conclusion, unless there exists some CPUs that support only AVX but not AVX2, will encounter some problems accessing ymm* registers in 32-bit mode, (which has already been proven not to be the case), the above-mentioned restriction is not necessary. And I still hope Visual C++ compiler can be improved to make this optimization option available since many computers support only AVX but not AVX2, and using ymm* registers can double the performance of floating point arithmetic.

查看更多
Evening l夕情丶
3楼-- · 2019-07-11 14:11

I made a text file vec.cpp

//vec.cpp
void foo(float *a, float *b, float *c) {
    for (int i = 0; i < 10000; i++) c[i] = (a[i] + b[i])*b[i];
}

went to the command line with Visual Studio 2015 x86 x64 enabled and did

cl /c /O2 /arch:AVX /FA vec.cpp

looked at the file vec.asm and I see

$LL4@foo:
    vmovups ymm0, YMMWORD PTR [rax-32]
    lea rax, QWORD PTR [rax+64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-96]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-96], ymm2
    vmovups ymm0, YMMWORD PTR [rax-64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-64]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-64], ymm2
    sub rdx, 1
    jne SHORT $LL4@foo
    vzeroupper

The problem is that you are compiling in 32-bit mode. Using the same function above but compiling in 32-bit mode I get

$LL4@foo:
    lea eax, DWORD PTR [ebx+esi]
    lea ecx, DWORD PTR [ecx+32]
    lea esi, DWORD PTR [esi+32]
    vmovups xmm1, XMMWORD PTR [esi-48]
    vaddps  xmm0, xmm1, XMMWORD PTR [ecx-32]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [edx+ecx-32], xmm0
    vmovups xmm1, XMMWORD PTR [esi-32]
    vaddps  xmm0, xmm1, XMMWORD PTR [eax]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [eax+edx], xmm0
    sub edi, 1
    jne SHORT $LL4@foo
查看更多
登录 后发表回答