How can I set __m128i without using of any SSE ins

2019-04-30 18:36发布

问题:

I have many function which use the same constant __m128i values. For example:

const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8);
const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4);

So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash during constants initialization.

So is it possible to initialize these constants without using of SSE?

回答1:

Initialization of __m128i vector without using SSE instructions is possible but it depends on how to compiler defines __m128i.

For Microsoft Visual Studio you can define next macros (it defines __m128i as char[16]):

template <class T> inline char GetChar(T value, size_t index)
{
    return ((char*)&value)[index];
}

#define AS_CHAR(a) char(a)

#define AS_2CHARS(a) \
    GetChar(int16_t(a), 0), GetChar(int16_t(a), 1)

#define AS_4CHARS(a) \
    GetChar(int32_t(a), 0), GetChar(int32_t(a), 1), \
    GetChar(int32_t(a), 2), GetChar(int32_t(a), 3)

#define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \
    {AS_CHAR(a0), AS_CHAR(a1), AS_CHAR(a2), AS_CHAR(a3), \
     AS_CHAR(a4), AS_CHAR(a5), AS_CHAR(a6), AS_CHAR(a7), \
     AS_CHAR(a8), AS_CHAR(a9), AS_CHAR(aa), AS_CHAR(ab), \
     AS_CHAR(ac), AS_CHAR(ad), AS_CHAR(ae), AS_CHAR(af)}

#define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \
    {AS_2CHARS(a0), AS_2CHARS(a1), AS_2CHARS(a2), AS_2CHARS(a3), \
     AS_2CHARS(a4), AS_2CHARS(a5), AS_2CHARS(a6), AS_2CHARS(a7)}

#define _MM_SETR_EPI32(a0, a1, a2, a3) \
    {AS_4CHARS(a0), AS_4CHARS(a1), AS_4CHARS(a2), AS_4CHARS(a3)}       

For GCC it will be (it defines __m128i as long long[2]):

#define CHAR_AS_LONGLONG(a) (((long long)a) & 0xFF)

#define SHORT_AS_LONGLONG(a) (((long long)a) & 0xFFFF)

#define INT_AS_LONGLONG(a) (((long long)a) & 0xFFFFFFFF)

#define LL_SETR_EPI8(a, b, c, d, e, f, g, h) \
    CHAR_AS_LONGLONG(a) | (CHAR_AS_LONGLONG(b) << 8) | \
    (CHAR_AS_LONGLONG(c) << 16) | (CHAR_AS_LONGLONG(d) << 24) | \
    (CHAR_AS_LONGLONG(e) << 32) | (CHAR_AS_LONGLONG(f) << 40) | \
    (CHAR_AS_LONGLONG(g) << 48) | (CHAR_AS_LONGLONG(h) << 56)

#define LL_SETR_EPI16(a, b, c, d) \
    SHORT_AS_LONGLONG(a) | (SHORT_AS_LONGLONG(b) << 16) | \
    (SHORT_AS_LONGLONG(c) << 32) | (SHORT_AS_LONGLONG(d) << 48)

#define LL_SETR_EPI32(a, b) \
    INT_AS_LONGLONG(a) | (INT_AS_LONGLONG(b) << 32)        

#define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \
    {LL_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7), LL_SETR_EPI8(a8, a9, aa, ab, ac, ad, ae, af)}

#define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \
    {LL_SETR_EPI16(a0, a1, a2, a3), LL_SETR_EPI16(a4, a5, a6, a7)}

#define _MM_SETR_EPI32(a0, a1, a2, a3) \
    {LL_SETR_EPI32(a0, a1), LL_SETR_EPI32(a2, a3)}        

So in your code initialization of __m128i constant will be look like:

const __m128i K8 = _MM_SETR_EPI8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
const __m128i K16 = _MM_SETR_EPI16(1, 2, 3, 4, 5, 6, 7, 8);
const __m128i K32 = _MM_SETR_EPI32(1, 2, 3, 4);


回答2:

I suggest defining the initialisation data globally as scalar data and then load it locally into a const __m128i:

static const uint8_t gK8[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

static inline foo()
{
    const __m128i K8 = _mm_loadu_si128((__m128i *)gK8);

    // ...
}


回答3:

You can use a union.

union M128 {
   char[16] i8;
   __m128i i128;
};

const M128 k8 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

If the M128 union is defined locally where you use the loop, this should have no performance overhead (it will be loaded in memory once at the begin of the loop). Because it contains a variable of type __m128i, M128 inherits the correct alignment.

void foo()
{
   M128 k8 = ...;
   // use k8.i128 in your for loop
}

If it is defined somewhere else, then you need to copy into a local register before you start the loop, otherwise the compiler may not be able to optimize it.

void foo()
{
    __m128i tmp = k8.i128;
    // for loop here
}

This will load k8 into a cpu register and keep it there for the duration of the loop, as long as there enough free registers to carry out the loop body.

Depending on what compiler you use, these unions may be already defined (VS does), but the compiler's provided definitions may not be portable.



回答4:

You usually don't need this. Compilers are very good at using the same storage for multiple functions that use the same constant. Just like merging multiple instances of the same string literal into one string constant, multiple instances of the same _mm_set* in different functions will all load from the same vector constant (or generate on the fly for _mm_setzero_si128() or _mm_set1_epi8(-1)).

Using Godbolt's binary output (disassembly) mode lets you see whether different functions are loading from the same block of memory or not. Look at the comment it adds, which resolves the RIP-relative addresses to absolute addresses.

  • gcc: all identical constants share the same storage, regardless of whether they're from auto-vectorization or _mm_set. 32B constants can't overlap with 16B constants, even if the 16B constant is a subset of the 32B.

  • clang: identical constants share storage. 16B and 32B constants don't overlap, even when one is a subset of the other. Some functions using repetitive constants use an AVX2 vpbroadcastd broadcast-load (which doesn't even take an ALU uop on Intel SnB-family CPUs). For some reason, it chooses to do this based on the element size of the operation, not the repetitivity of the constant. Note that clang's asm output repeats the constant for each use, but the final binary doesn't.

  • MSVC: identical constants share storage. Pretty much the same as what gcc does. (The full asm output is hard to wade through; use search. I could only get the asm at all by having main find the path to the .exe, then work out the path to the asm output made with cl.exe -O2 /FAs, and run system("type .../foo.asm")).

Compiler are good at this, since it's not a new problem. It's existed with strings since the earliest days of compilers.

I haven't checked if this works across source files (e.g. for an inline vector function used in multiple compilation units). If you do still want static / global vector constants, see below:


It appears there is no easy and portable way to statically initialize an static/global __m128. C compilers won't even accept _mm_set* as an initializer, because it works like a function. They don't take advantage of the fact that they could actually see through it to a compile-time-constant 16B

const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4);   // Illegal in C
// C++: generates a constructor that copies from .rodata to the BSS

Even though the constructor only requires SSE1 or SSE2, you don't want this anyway. It's horrible. DON'T DO THIS. You end up paying the memory cost of your constants twice.


Fabio's union answer looks like the best portable way to statically initialize a vector constant, but it means you have to access the __m128i union member. It may help with grouping related constants near each other (hopefully in the same cache line) even if they're used by scattered functions. There are non-portable ways to accomplish, that, too (e.g. put related constants in their own ELF section with GNU C __attribute__ ((section ("constants_for_task_A")))). Hopefully that can group them together in the .rodata section (which becomes part of the .text section).