Avoiding AVX-SSE (VEX) Transition Penalties

2019-04-16 16:19发布

Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode.

I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions.

I just want to use two ymm registers (and possibly zmm - the affordable processors supporting zmm are promised to be available this year) for fast memory copy.

Question is: how to use the ymm registers but avoid the transition penalties?

Will the penalty occur when I use just ymm8-ymm15 registers (not ymm0-ymm7)? SSE originally had eight 128-bit registers (xmm0-xmm7), but in 64-bit mode there are (xmm8-xmm15) also available for non-VEX-prefixed instructions. However, I have reviewed our 64-bit application and it only use xmm0-xmm7, since it also has a 32-bit version with almost the same code. Does the penalty only occur when the CPU tries in fact to use an xmm register that had been used before as ymm and has one of higher 128 bits non-zero? Isn't it better to just zeroize the ymm registers that I have used after the fast memory copy? For example, I have used an ymm register once to copy 32 bytes of memory - what is the fastest way to zeroize it? Is "vpxor ymm15, ymm15, ymm15" fast enough? (AFAIK, vpxor can be executed on any of the 3 ALU execution ports, p0/p1/p5, while vxorpd can only be execute on p5). Wouldn't be the time to zeroize it more than the gain of using it to just copy 32 bytes of memory?

标签: avx sse vex
5条回答
祖国的老花朵
2楼-- · 2019-04-16 16:57

In my experience the best way to Avoiding AVX-SSE (VEX) Transition Penalties is to let the compiler use the native code of the micro-architecture. For example, you can use SSE-Intrinsics alongside the AVX-Intrinsics and use -march=native. My GCC 6.2 compiles the program and uses VEX-Encoded instructions. If you see the assembly generated you will find an extra v before all SSE translated codes. On the other hand, if you are doubted you can use a __asm__ __volatile__ ( "vzeroupper" : : : ); every point of your program, after using ymm registers, but you should be careful about it.

查看更多
可以哭但决不认输i
3楼-- · 2019-04-16 17:04

Another possibility is to use registers zmm16 - zmm31. These regsters have no non-VEX counterpart. There is no state transition and no penalty for mixing zmm16 - zmm31 with non-VEX SSE code. These 512-bit registers are only available in 64 bit mode and only on processors with AVX512.

查看更多
Anthone
4楼-- · 2019-04-16 17:07

I have found an interested note by Agner on an Intel forum at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/704023

It answers the question on what happens if I just use ymm8-ymm9 while the application uses xmm0-xmm7, so we use different registers.

Here is the quote.

I just made a few more experiments on a Haswell. It treats all vector registers as having a dirty upper half if just one ymm register has been touched. In other words, if you modify ymm1 then a non-VEX instruction writing to xmm2 will have a false dependense on the previous value of xmm2. Knights Landing has no such false dependence. Perhaps it is remembering the state of each register separately?

Hopefully, future Intel processors will either remember the state of each register separately, or at least treat zmm16-zmm31 separately so that they don't pollute xmm0-xmm15. Can you reveal something about this?

This answer from 12/28/2016 left unreplied.

There were also some interesting information about VZEROUPPER on Agnger's blog at http://www.agner.org/optimize/blog/read.php?i=761

查看更多
太酷不给撩
5楼-- · 2019-04-16 17:10

The optimal solution is probably to recompile all the code with VEX prefixes. The VEX coded instructions are mostly the same size as the non-VEX versions of the same instructions because the non-VEX instructions carry a legacy of a lot of prefixes and escape codes (due to a long history of short-sighted patches in the instruction coding scheme). The VEX prefix combines all the old prefixes and escape codes into a single prefix of two or three bytes (four bytes for AVX512).

A VEX/non-VEX transition works in different ways on different processors (see Why is this SSE code 6 times slower without VZEROUPPER on Skylake?):

Older Intel processors: The VZEROUPPER instruction is needed for a clean transition between different internal states in the processor.

On Intel Skylake or later Processors: The VZEROUPPER is needed to avoid a false dependence of a non-VEX instruction on the upper part of the register.

On current AMD processors: A 256-bit register is treated as two 128-bit registers. The VZEROUPPER is not needed, except for compatibility with Intel processors. The cost of VZEROUPPER is approximately 6 clock cycles.

The advantage of using VEX prefixes on all your instructions is that you avoid these transition costs on all processors. Your legacy code can probably benefit from some 256-bit operations here and there in the hot innermost loop.

The disadvantage of VEX prefixes is that the code is incompatible with old processors, so you might need to preserve your old version for running on old processorrs

查看更多
等我变得足够好
6楼-- · 2019-04-16 17:15

To avoid the penalties on all architectures just need to issue vzeroall or vzeroupper after the part of your code that uses VEX-encoded instructions, prior to returning to the rest of the code that uses non-VEX instruction.

Issuing those instruction is considered good practice for all AVX-using routines anyway, and is cheap - except perhaps on Knights Landing, but I doubt you are using that architecture. Even if you are, the performance characteristics are quite different from the desktop/Xeon family, so you'll probably want a separate compile there anyway.

These are the only instructions that move from the dirty upper to the clean upper state. You can't simple zero out specific registers that you've used, as the chip isn't tracking the dirty state on a register-by-register basis.

The cost of these vzero* instructions is a few cycles: so if whatever you are doing in AVX is worth it, it will generally be worth it to pay this small cost.

查看更多
登录 后发表回答