Convention for displaying vector registers

2020-02-02 00:51发布

问题:

Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set?

For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm register, for a byte-wise display is the following preferred (little-endian):

[1, 0, 0, 0, ..., 0, 20]

or is this preferred:

[20, 0, 0, 0, ..., 0, 1]

Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to display the register as DWORDs, I assume each DWORD is still written in the usual (big-endian) way, but what is the order of the DWORDS:

[0x1, 0x0, ..., 0x14]

vs

[0x14, 0x0, ..., 0x1]

Discussion

I think the two most promising answers are simply "LSE1 first" (i.e., the first output in the examples above) or "MSE first" (the second output). Neither depends on the endianness of the platform, as indeed once in a register data is generally endian independent (just like operations on a GP register or a long or int or whatever in C are independent of endianness). Endianness comes up in the register <-> memory interface, and here I'm asking about data already in a register.

It is possible that other answers exist, such as output that depends on endianness (and Paul R's answer may be one, but I can't tell).

LSE First

One advantage of LSE-first seems to be especially with byte-wise output: often the bytes are numbered from 0 to N, with the LSB being zero2, so LSB-first output outputs it with increasing indexes, much like you'd output an array of bytes of size N.

It's also nice on little endian architectures since the output then matches the in-memory representation of the same vector stored to memory.

MSE First

The main advantage here seems to be that the output for smaller elements is in the same order as for larger sizes (only with different grouping). For example, for a 4-byte vector in MSB notation [0x4, 0x3, 0x2, 0x1], the output for byte elements, word and dword elements would be:

[0x4, 0x3, 0x2, 0x1] [ 0x0403, 0x0201 ] [ 0x04030201 ]

Essentially, even from the byte output you can just "read off" the word or dword output, or vice-versa, since the bytes are already in the usual MSB-first order for number display. On the other hand, the corresponding output for LSE-first is:

[0x1, 0x2, 0x3, 0x4] [ 0x0201 , 0x0403 ] [ 0x04030201 ]

Note that each layer undergoes swaps relative to the row above it, so it's much harder to read off larger or smaller values. You'd need to rely more on outputting the element that is the most natural for your problem.

This format also has the advantage that on BE architectures the output then matches the in-memory representation of the same vector stored to memory3.

Intel uses MSE first in its manuals.


1 Least Significant Element

2 Such numberings are not just for documentation purposes - they are architecturally visible, e.g., in shuffle masks.

3 Of course this advantage is minuscule compared to the corresponding advantage of LSE-first on LE platforms since BE is almost dead in commodity SIMD hardware.

回答1:

Being consistent is the most important thing; If I'm working on existing code that already has LSE-first comments or variable names, I match that.

Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes.

Intel uses MSE-first not only in their diagrams in manuals, but in the naming of intrinsics/instructions like pslldq (byte shift) and psrlw (bit-shift): a left bit/byte shift goes towards the MSB. LSE-first thinking doesn't save you from mentally reversing things, it means you have to do it when thinking about shifts instead of loads/stores. Since x86 is little-endian, you sometimes have to be thinking about this anyway.


In MSE-first thinking about vectors, just remember that memory order is right to left. When you need to think about overlapping unaligned loads from a block of memory, you can draw the memory contents in right-to-left order, so you can look at vector-length windows of it.

In a text editor, it's no problem to add new text at the left hand side of something and have the existing text displaced to the right, so adding more elements to a comment isn't a problem.

Two major downsides to MSE-first notation are:

  • harder to type the alphabet backwards (like h g f e | d c b a for an AVX vector of 32-bit elements), so I sometimes just start from the right and type a, left-arrow, b, space, ctrl-left arrow, c, space, ... or something like that.

  • Opposite from C array-initializer order. Normally not a problem, because _mm_set_epi* uses MSE-first order. (Use _mm_setr_epi* to match LSE-first comments).


An example where MSE-first is nice is when trying to design a lane-crossing version of 256b vpalignr: See my answer on that question How to concatenate two vector efficiently using AVX2?. That includes design-notes in MSE-first notation.

As another example, consider implementing a variable-count byte-shift across a whole vector. You could make a table of pshufb control vectors, but that would be a massive waste of cache footprint. Much better to load a sliding window from memory:

/*  Example of using MSE notation for memory as well as vectors

// 4-element vectors to keep the design notes compact
// I started by just writing down a couple rows of this, then noticing which way they lined up
<< 3:                       00 FF FF FF
<< 1:                 02 01 00 FF
   0:              03 02 01 00
>> 2:        FF FF 03 02
>> 3:     FF FF FF 03
>> 4:  FF FF FF FF

       FF FF FF FF 03 02 01 00 FF FF FF FF
  highest address                       lowest address
*/

#include <immintrin.h>
#include <stdint.h>
// positive counts are right shifts, negative counts are left
// a left-only or right-only implementation would only have one side of the table,
// and only need 32B alignment for the constant in memory to prevent cache-line splits.
__m128i vshift(__m128i v, intptr_t bytes_right)
{   // intptr_t means the caller has to sign-extend it to the width of a pointer, saving a movsx in the non-inline version

   // C11 uses _Alignas, C++11 uses alignas
    _Alignas(64) static const int32_t shuffles[] = { 
        -1, -1, -1, -1,
        0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c,
        -1, -1, -1, -1
    };  // compact but messy with a mix of ordering :/
    const char *identity_shuffle = 16 + (const char*)shuffles;  // points to the middle 16B

    //  count &= 0xf;  tricky to efficiently limit the count while still allowing >>16 to zero the vector, and to allow negative.
    __m128i control = _mm_load_si128((const __m128i*) (identity_shuffle + bytes_right));
    return _mm_shuffle_epi8(v, control);
}

This is kind of the worst-case for MSE-first, because right-shifts take a window from farther left. In LSE-first notation, it might look more natural. Still, unless I got something backwards :P, I think it shows that you can successfully use MSE-first notation even for something you'd expect to be tricky. It didn't feel mind-bending or over-complicated. I just started writing down shuffle control vectors and then lined them up. I could have made it slightly simpler when translating to a C array if I'd used uint8_t shuffles[] = { 0xff, 0xff, ..., 0, 1, 2, ..., 0xff };. I haven't tested this, only that it compiles to one instruction:

    vpshufb xmm0, xmm0, xmmword ptr [rdi + vshift.shuffles+16]
    ret

MSE lets you notice more easily when you can use a bit-shift instead of a shuffle instruction, to reduce pressure on port 5. e.g. psllq xmm, 16/_mm_slli_epi64(v,16) to shift word elements left by one (with zeroing at qword boundaries). Or when you need to shift byte elements, but the only available shifts are 16-bit or wider. The narrowest variable-per-element shifts are 32-bit elements (vpsllvd).

MSE makes it easy to get the shuffle constant right when using larger or smaller granularity shuffles or blends, e.g. pshufd when you can keep pairs of word elements together, or pshufb to shuffle words across the whole vector (because pshuflw/hw is limited).

_MM_SHUFFLE(d,c,b,a) goes in MSE order, too. So does any other way of writing it as a single integer, like C++14 0b11'10'01'00 or 0xE4 (the identity shuffle). Using LSE-first notation will make your shuffle constants look "backwards" relative to your comments. (except for pshufb constants, which you can write with _mm_setr)



回答2:

My rule of thumb is: match the equivalent layout in memory, so if you have 0x1 0x2 0x3 ... 0xf in memory, and you load it to a vector register, then displaying the contents of the vector register should also look like 0x1 0x2 0x3 ... 0xf.

If you use the %v format extensions for printf that are supported by some compilers (e.g. Apple's gcc and clang) then this is the behaviour that you get, and I find it helpful, as you can almost forget about the vagaries of little endianness, e.g.

#include <stdio.h>
#include <stdint.h>
#include <xmmintrin.h>

int main(void)
{
    uint8_t a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

    __m128i v = _mm_loadu_si128((__m128i *)a);

    printf("v = %#vx\n", v);
    printf("v = %#vhx\n", v);
    printf("v = %#vlx\n", v);

    return 0;
}

With a suitable compiler this gives:

v = 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xa 0xb 0xc 0xd 0xe 0xf 0x10
v = 0x201 0x403 0x605 0x807 0xa09 0xc0b 0xe0d 0x100f
v = 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d


标签: x86 sse simd avx