I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant.
The x86 architecture has accumulated a lot of math/multimedia extensions over decades:
- MMX
- 3DNow!
- SSE
- SSE2
- SSE3
- SSSE3
- SSE4
- AVX
- AVX2
- AVX512
- Did I forget something?
Are the newer ones supersets of the older ones and vice versa? Or are they complementary?
Are some of them deprecated? Which of these are still relevant? I've heard references to "legacy SSE".
Are some of them mutually exclusive? I.e. do they share the same hardware parts?
Which should I use together to maximize hardware utilization on modern Intel / AMD CPUs? For sake of argument, let's assume I can find appropriate uses for the instructions... heating my house with the CPU if nothing else.
I recently updated the tag wikis for SSE, AVX, and x86 (and SSE2, avx2). They cover a lot of this. tl;dr summary: AVX rolls up all the previous SSE versions, and provides 3-operand versions of those instructions. Also 256b versions of most FP (AVX) and int (AVX2) insns.
For summaries of the various SSE versions, see wikipedia, or knm241's more-detailed answer.
We don't really think of that making SSE obsolete. More like, think of AVX as a new and better version of the same old SSE instructions. They're still in the ref manual under their non-AVX names (PSHUFB
, not VPSHUFB
, for example.) You can mix AVX and SSE code, as long as you use VZEROUPPER
when needed to avoid the performance problem from mixing VEX with non-VEX insns (on Intel). So there is some annoyance to dealing with cases where you have to call into libraries that might run non-VEX SSE instructions, or where your code uses SSE FP math, but also has some AVX code to be run only if the CPU supports it.
If CPU-compatibility was a non-issue, the legacy-SSE versions of vector instructions would be truly obsolete, like MMX is now. AVX/AVX2 is at least slightly better in every way, if you count the VEX-encoded 128b version an insn as AVX, not SSE. Sometimes you'd still use 128b registers because your data only comes in chunks that big, but more often working with 256b registers to do the same op on twice as much data at once.
SSE/AVX/x87-FP/integer instructions all use the same execution ports. You can't get more done in parallel by mixing them. (except on Haswell, where one of the 4 ALU ports can only handle non-vector insns, like GP reg ops and branches).
They are complementary.
Each new instruction set extension add new instructions and eventually a new programming model (new registers for example).
None are deprecated, deprecating instructions is almost impossible to do for compatibility reasons. However some optional extensions may be absent or removed from newer models (like the FMA4 of AMD) if not very wide spread.
Some are vestigial though, everything that can be done with FPU and MMX for example can be done more efficiently with SSE+.
They are not mutually exclusive in the sense that you can use one or another, after all they are instructions not modes of operation (like real vs protected mode for example).
The only possible "conflict" is between MMX and FPU as they share the lower part of the same set of register but have different programming model.
The new vector registers have grown from 128 bit to 256 bit and to 512 bit, each time the previous registers have become the low part of the newer ones.
You can use all them together, they offer specific hardware support implementing simple operations.
They are like Lego bricks, you are only limited by your imagination (or the imagination of the designers).
Here a simple list of this instruction set extensions.
Only some features are listed, for the complete reference see Intel Manual Vol1 from chapter 9 to 14.
See also https://hjlebbink.github.io/x86doc/ for a table of contents of Intel's volume 2 (instruction set reference) manual, with a list of extensions that added instructions to that manual entry.
MMX
Introduce eight 64 bit registers (MM0-MM7) and instructions to work with eight signed/unsigned bytes, four signed/unsigned words, two signed/unsigned dwords.
3DNow!
Add support for single precision floating point operand to MMX. Few operation supported, for example addition, subtraction, multiplication.
SSE
Introduce eight/sixteen 128 bit registers (XMM0-XMM7/15) and instruction to work with four single precision floating point operands. Add integer operations on MMX registers too. (The MMX-integer part of SSE is sometimes called MMXEXT, and was implemented on a few non-Intel CPUs without xmm registers and the floating point part of SSE.)
SSE2
Introduces instruction to work with 2 double precision floating point operands, and with packed byte/word/dword/qword integers in 128-bit xmm registers.
SSE3
Add a few varied instructions (mostly floating point), including a special kind of unaligned load (lddqu
) that was better on Pentium 4, synchronization instruction, horizontal add/sub.
SSSE3
Again a varied set of instructions, mostly integer. The first shuffle that takes its control operand from a register instead of hard-coded (pshufb
). More horizontal processing, shuffle, packing/unpacking, mul+add on bytes, and some specialized integer add/mul stuff.
SSE4 (SSE4.1, SSE4.2)
Add a lot of instructions: Filling in a lot of the gaps by providing min and max and other operations for all integer data types (especially 32-bit integer had been lacking), where previously integer min was only available for unsigned bytes and signed 16-bit. Also scaling, FP rounding, blending, linear algebra operation, text processing, comparisons. Also a non temporal load for reading video memory, or copying it back to main memory. (Previously only NT stores were available.)
AESNI
Add support for accelerating AES symmetric encryption/decryption.
AVX
Add eight/sixteen 256 bit registers (YMM0-YMM7/15).
Support all previous floating point datatype. Three operand instructions.
FMA
Add Fused Multiply Add and correlated instructions.
AVX2
Add support for integer data types.
AVX512F
Add eight/thirty-two 512 bit registers (ZMM0-ZMM7/31) and eight 64-bit mask register (k0-k7). Promote most previous instruction to 512 bit wide. Optional parts of AVX512 add instruction for exponentials & reciprocals (AVX512ER), scatter/gather prefetching (AVX512PF), scatter conflict detection (AVX512CD), compress, expand.
IMCI (Intel Xeon Phi)
Early development of AVX512 for the first-gen Intel Xeon Phi (Knight's Corner) coprocessor.