When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in
static inline __m128i negate(__m128i a)
{
static __m128i zero = __mm_setzero_si128();
return _mm_sub_epi16(zero, a);
}
It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a true function instead of the _mm_setzero_si128() intrinsic. It only seems to be possible in C++, not in C, by the way.)
(1) However, once this initialization has happened: Does this block a xmm register for the rest of the program?
(2) Even worse: If such a static local variable is used in multiple functions, would it block multiple xmm registers?
(3) The other way round: If it is not blocking a xmm register, would the zero variable always be reloaded from memory when the function is called? Then the static local variable would be pointless since it would be faster to use _mm_setzero_si128().
As an alternative, I was thinking about putting zero into a global static variable that would be initialized at program start:
static __m128i zero = _mm_setzero_si128();
(4) Would the global variable stay in a xmm register while the program runs?
Thanks a lot for your help!
(Since this also applies to AVX intrinsics, I also added the AVX tag.)
Answering the question that should really be asked here: you should not be worrying about this at all. Zeroing a register via xor
effectively costs nothing at all most of the time. Modern x86 processors recognize this idiom and handle the zeroing directly in register rename; no µop needs to issue at all. The only time this can slow you down is if you are bound by the front-end, but that is a rather rare situation to be in.
While variations on these questions might be worth pondering in other circumstances (and Mystical's comment gives some good leads on how to answer them yourself), you should really just use setzero
and call it a day.
In regards to this particular operation you should do at Stephen Canon says and do
static inline Vec8s operator - (Vec8s const & a) {
return _mm_sub_epi16(_mm_setzero_si128(), a);
}
That's taken directly from Agner Fog's Vector Class Library.
But let's consider what the static
keyword does. When you declare a variable using static
it uses static storage. This places it in the data section (which includes the .bss section) of your object file.
#include <x86intrin.h>
extern "C" void foo2(__m128i a);
static const __m128i zero = _mm_setzero_si128();
static inline __m128i negate(__m128i a) {
return _mm_sub_epi16(zero, a);
}
extern "C" void foo(__m128i a, __m128i b) {
foo2(negate(a));
}
I do g++ -O3 -c static.cpp
and then look at the diassembly and sections. I see
there is a .bss section with a label _ZL4zero
. Then there is a code startup section which writes the static variable in the .bss section.
.text.startup
pxor xmm0, xmm0
movaps XMMWORD PTR _ZL4zero[rip], xmm0
ret
The foo function
movdqa xmm1, XMMWORD PTR _ZL4zero[rip]
psubw xmm1, xmm0
movdqa xmm0, xmm1
So GCC never uses a XMM register for the static variable. It reads from memory in the data section.
What if we did _mm_sub_epi16(_mm_setzero_si128(),a)
? Then GCC produces for foo
pxor xmm1, xmm1
psubw xmm1, xmm0
movdqa xmm0, xmm1
On Intel processors since Sandy Bridge the pxor
is "free". On processors before that it's almost free. So this is clearly a better solution than reading from memory.
What if we tried _mm_sub_epi16(_mm_set1_epi32(-1),a)
. In that case GCC produces
pcmpeqd xmm1, xmm1
psubw xmm1, xmm0
movdqa xmm0, xmm1
The pcmpeqd
instruction is not free
on any processor but it's still better than reading from memory using movdqa
. Okay, so 0
and -1
are special. What about _mm_sub_epi16(_mm_set1_epi32(1)
)? In this case GCC produces for foo
movdqa xmm1, XMMWORD PTR .LC0[rip]
psubw xmm1, xmm0
movdqa xmm0, xmm1
That's essentially the same as using a static variable! When I look at the sections I see that .LC0 points to a read only data section (.rodata).
Edit: here is a way to get GCC use use a global variable in register.
register __m128i zero asm ("xmm15") = _mm_set1_epi32(1);
This produces
movdqa xmm2, xmm15
psubw xmm2, xmm0
movdqa xmm0, xmm2
Since you use vectors for efficiency, your code has a problem.
A static variable that isn't initialised with a constant will be initialised at runtime. In a thread safe way. The first time your inline function is called, the static variable is initialised. On every single call after that, a check is made whether the static variable needs initialising or not.
So on every call, there is a check, then there is a load from memory. If you don't use a static variable, there's probably a single instruction creating the value, plus plenty of opportunity for optimisation. Loading from memory is slow.
And you can have as many static variables as you like. The compiler will handle anything you throw at it.
I think I can add an interesting point to the discussion, particularly to my comment on _mm_abs_ps(). If I define
static inline __m128 _mm_abs_ps_2(__m128 x) {
__m128 signMask = _mm_set1_ps(-0.0F);
return _mm_andnot_ps(signMask, x);
}
(Agner Fog's VCL http://www.agner.org/optimize/#vectorclass uses an integer set1, a cast, and an AND operation instead, but that should in effect be the same) and use the function in a loop
float *p = data;
for (int i = 0; i < LEN; i += 4, p += 4)
_mm_store_ps(p, _mm_abs_ps_2(_mm_load_ps(p)));
then gcc (4.6.3, -O3) is clever enough to avoid repeatedly executing _mm_set1_ps by moving it outside the loop:
vmovaps xmm1, XMMWORD PTR .LC1[rip] # tmp108,
mov rax, rsp # p,
.L3:
vandnps xmm0, xmm1, XMMWORD PTR [rax] # tmp102, tmp108, MEM[base: p_54, offset: 0B]
vmovaps XMMWORD PTR [rax], xmm0 # MEM[base: p_54, offset: 0B], tmp102
add rax, 16 # p,
cmp rax, rbp # p, D.7371
jne .L3 #,
.LC1:
.long 2147483648
.long 2147483648
.long 2147483648
.long 2147483648
So, probably in most cases one shouldn't worry at all about repeatedly setting some xmm register to a constant inside some function.