are static / static local SSE / AVX variables bloc

2020-06-04 04:33发布

问题:

When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in

static inline __m128i negate(__m128i a)
{
   static __m128i zero = __mm_setzero_si128();
   return _mm_sub_epi16(zero, a);
}

It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a true function instead of the _mm_setzero_si128() intrinsic. It only seems to be possible in C++, not in C, by the way.)

(1) However, once this initialization has happened: Does this block a xmm register for the rest of the program?

(2) Even worse: If such a static local variable is used in multiple functions, would it block multiple xmm registers?

(3) The other way round: If it is not blocking a xmm register, would the zero variable always be reloaded from memory when the function is called? Then the static local variable would be pointless since it would be faster to use _mm_setzero_si128().

As an alternative, I was thinking about putting zero into a global static variable that would be initialized at program start:

static __m128i zero = _mm_setzero_si128();

(4) Would the global variable stay in a xmm register while the program runs?

Thanks a lot for your help!

(Since this also applies to AVX intrinsics, I also added the AVX tag.)

回答1:

Answering the question that should really be asked here: you should not be worrying about this at all. Zeroing a register via xor effectively costs nothing at all most of the time. Modern x86 processors recognize this idiom and handle the zeroing directly in register rename; no µop needs to issue at all. The only time this can slow you down is if you are bound by the front-end, but that is a rather rare situation to be in.

While variations on these questions might be worth pondering in other circumstances (and Mystical's comment gives some good leads on how to answer them yourself), you should really just use setzero and call it a day.



回答2:

In regards to this particular operation you should do at Stephen Canon says and do

static inline Vec8s operator - (Vec8s const & a) {
    return _mm_sub_epi16(_mm_setzero_si128(), a);
}

That's taken directly from Agner Fog's Vector Class Library.

But let's consider what the static keyword does. When you declare a variable using static it uses static storage. This places it in the data section (which includes the .bss section) of your object file.

#include <x86intrin.h>
extern "C" void foo2(__m128i a);

static const __m128i zero = _mm_setzero_si128();

static inline __m128i negate(__m128i a) {
    return _mm_sub_epi16(zero, a);
}

extern "C" void foo(__m128i a, __m128i b) {
    foo2(negate(a));
}

I do g++ -O3 -c static.cpp and then look at the diassembly and sections. I see there is a .bss section with a label _ZL4zero. Then there is a code startup section which writes the static variable in the .bss section.

.text.startup
    pxor    xmm0, xmm0
    movaps  XMMWORD PTR _ZL4zero[rip], xmm0
    ret

The foo function

    movdqa  xmm1, XMMWORD PTR _ZL4zero[rip]
    psubw   xmm1, xmm0
    movdqa  xmm0, xmm1

So GCC never uses a XMM register for the static variable. It reads from memory in the data section.

What if we did _mm_sub_epi16(_mm_setzero_si128(),a)? Then GCC produces for foo

    pxor    xmm1, xmm1
    psubw   xmm1, xmm0
    movdqa  xmm0, xmm1

On Intel processors since Sandy Bridge the pxor is "free". On processors before that it's almost free. So this is clearly a better solution than reading from memory.

What if we tried _mm_sub_epi16(_mm_set1_epi32(-1),a). In that case GCC produces

    pcmpeqd xmm1, xmm1
    psubw   xmm1, xmm0
    movdqa  xmm0, xmm1

The pcmpeqd instruction is not free on any processor but it's still better than reading from memory using movdqa. Okay, so 0 and -1 are special. What about _mm_sub_epi16(_mm_set1_epi32(1))? In this case GCC produces for foo

    movdqa  xmm1, XMMWORD PTR .LC0[rip]
    psubw   xmm1, xmm0
    movdqa  xmm0, xmm1

That's essentially the same as using a static variable! When I look at the sections I see that .LC0 points to a read only data section (.rodata).

Edit: here is a way to get GCC use use a global variable in register.

register __m128i zero asm ("xmm15") = _mm_set1_epi32(1);

This produces

movdqa  xmm2, xmm15
psubw   xmm2, xmm0
movdqa  xmm0, xmm2


回答3:

Since you use vectors for efficiency, your code has a problem.

A static variable that isn't initialised with a constant will be initialised at runtime. In a thread safe way. The first time your inline function is called, the static variable is initialised. On every single call after that, a check is made whether the static variable needs initialising or not.

So on every call, there is a check, then there is a load from memory. If you don't use a static variable, there's probably a single instruction creating the value, plus plenty of opportunity for optimisation. Loading from memory is slow.

And you can have as many static variables as you like. The compiler will handle anything you throw at it.



回答4:

I think I can add an interesting point to the discussion, particularly to my comment on _mm_abs_ps(). If I define

static inline __m128 _mm_abs_ps_2(__m128 x) {
  __m128 signMask = _mm_set1_ps(-0.0F);
  return _mm_andnot_ps(signMask, x);
}

(Agner Fog's VCL http://www.agner.org/optimize/#vectorclass uses an integer set1, a cast, and an AND operation instead, but that should in effect be the same) and use the function in a loop

float *p = data;
for (int i = 0; i < LEN; i += 4, p += 4)
  _mm_store_ps(p, _mm_abs_ps_2(_mm_load_ps(p)));

then gcc (4.6.3, -O3) is clever enough to avoid repeatedly executing _mm_set1_ps by moving it outside the loop:

    vmovaps xmm1, XMMWORD PTR .LC1[rip] # tmp108,
    mov rax, rsp    # p,
.L3:
    vandnps xmm0, xmm1, XMMWORD PTR [rax]   # tmp102, tmp108, MEM[base: p_54, offset: 0B]
    vmovaps XMMWORD PTR [rax], xmm0 # MEM[base: p_54, offset: 0B], tmp102
    add rax, 16 # p,
    cmp rax, rbp    # p, D.7371
    jne .L3 #,
.LC1:
    .long   2147483648
    .long   2147483648
    .long   2147483648
    .long   2147483648

So, probably in most cases one shouldn't worry at all about repeatedly setting some xmm register to a constant inside some function.



标签: c++ sse avx