Profiling _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f}

2020-03-30 03:15发布

问题:

EDIT: As Cody Gray pointed out in his comment, profiling with disabled optimization is complete waste of time. How then should i approach this test?


Microsoft in its XMVectorZero in case if defined _XM_SSE_INTRINSICS_ uses _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f} if don't. I decided to check how big is the win. So i used the following program in Release x86 and Configuration Properties>C/C++>Optimization>Optimization set to Disabled (/Od).

constexpr __int64 loops = 1e9;
inline void fooSSE() {
    for (__int64 i = 0; i < loops; ++i) {
        XMVECTOR zero1 = _mm_setzero_ps();
        //XMVECTOR zero2 = _mm_setzero_ps();
        //XMVECTOR zero3 = _mm_setzero_ps();
        //XMVECTOR zero4 = _mm_setzero_ps();
    }
}
inline void fooNoIntrinsic() {
    for (__int64 i = 0; i < loops; ++i) {
        XMVECTOR zero1 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero2 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero3 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero4 = { 0.f,0.f,0.f,0.f };
    }
}
int main() {
    fooNoIntrinsic();
    fooSSE();
}

I ran the program twice first with only zero1 and second time with all lines uncommented. In the first case intrinsic loses, in the second intrinsic is clear winner. So, my questions are:

  • Why intrinsic does not always win?
  • Does the profiler i used is a proper tool for such measurements?

回答1:

Profiling things with optimization disabled gives you meaningless results and is a complete waste of time. If you are disabling optimization because otherwise the optimizer notices that your benchmark actually does nothing useful and is removing it entirely, then welcome to the difficulties of microbenchmarking!

It is often very difficult to concoct a test case that actually does enough real work that it will not be removed by a sufficiently smart optimizer, yet the cost of that work does not overwhelm and render meaningless your results. For example, a lot of people's first instinct is to print out the incremental results using something like printf, but that's a non-starter because printf is incredibly slow and will absolutely ruin your benchmark. Making the variable that collects the intermediate values as volatile will sometimes work because it effectively disables load/store optimizations for that particular variable. Although this relies on ill-defined semantics, that's not important for a benchmark. Another option is to perform some pointless yet relatively cheap operation on the intermediate results, like add them together. This relies on the optimizer not outsmarting you, and in order to verify that your benchmark results are meaningful, you'll have to examine the object code emitted by the compiler and ensure that the code is actually doing the thing. There is no magic bullet for crafting a microbenchmark, unfortunately.

The best trick is usually to isolate the relevant portion of the code inside of a function, parameterize it on one or more unpredictable input values, arrange for the result to be returned, and then put this function in an external module such that the optimizer can't get its grubby paws on it.

Since you'll need to look at the disassembly anyway to confirm that your microbenchmark case is suitable, this is often a good place to start. If you are sufficiently competent in reading assembly language, and you have sufficiently distilled the code in question, this may even be enough for you to make a judgment about the efficiency of the code. If you can't make heads or tails of the code, then it is probably sufficiently complicated that you can go ahead and benchmark it.

This is a good example of when a cursory examination of the generated object code is sufficient to answer the question without even needing to craft a benchmark.

Following my advice above, let's write a simple function to test out the intrinsic. In this case, we don't have any input to parameterize upon because the code literally just sets a register to 0. So let's just return the zeroed structure from the function:

DirectX::XMVECTOR ZeroTest_Intrinsic()
{
    return _mm_setzero_ps();
}

And here is the other candidate that performs the initialization the seemingly-naïve way:

DirectX::XMVECTOR ZeroTest_Naive()
{
    return { 0.0f, 0.0f, 0.0f, 0.0f };
}

Here is the object code generated by the compiler for these two functions (it doesn't matter which version, whether you compile for x86-32 or x86-64, or whether you optimize for size or speed; the results are the same):

ZeroTest_Intrinsic
    xorps  xmm0, xmm0
    ret
ZeroTest_Naive
    xorps  xmm0, xmm0
    ret

(If AVX or AVX2 instructions are supported, then these will both be vxorps xmm0, xmm0, xmm0.)

That is pretty obvious, even to someone who cannot read assembly code. They are both identical! I'd say that pretty definitively answers the question of which one will be faster: they will be identical because the optimizer recognizes the seemingly-naïve initializer and translates it into a single, optimized assembly-language instruction for clearing a register.

Now, it is certainly possible that there are cases where this is embedded deep within various complicated code constructs, preventing the optimizer from recognizing it and performing its magic. In other words, the "your test function is too simple!" objection. And that is most likely why the library's implementer chose to explicitly use the intrinsic whenever it is available. Its use guarantees that the code-gen will emit the desired instruction, and therefore the code will be as optimized as possible.

Another possible benefit of explicitly using the intrinsic is to ensure that you get the desired instruction, even if the code is being compiled without SSE/SSE2 support. This isn't a particularly compelling use-case, as I imagine it, because you wouldn't be compiling without SSE/SSE2 support if it was acceptable to be using these instructions. And if you were explicitly trying to disable the generation of SSE/SSE2 instructions so that you could run on legacy systems, the intrinsic would ruin your day because it would force an xorps instruction to be emitted, and the legacy system would throw an invalid operation exception immediately upon hitting this instruction.

I did see one interesting case, though. xorps is the single-precision version of this instruction, and requires only SSE support. However, if I compile the functions shown above with only SSE support (no SSE2), I get the following:

ZeroTest_Intrinsic
    xorps  xmm0, xmm0
    ret
ZeroTest_Naive
    push   ebp
    mov    ebp, esp
    and    esp, -16
    sub    esp, 16

    mov    DWORD PTR [esp],    0
    mov    DWORD PTR [esp+4],  0
    mov    DWORD PTR [esp+8],  0
    mov    DWORD PTR [esp+12], 0
    movaps xmm0, XMMWORD PTR [esp]

    mov    esp, ebp
    pop    ebp
    ret

Clearly, for some reason, the optimizer is unable to apply the optimization to the use of the initializer when SSE2 instruction support is not available, even though the xorps instruction that it would be using does not require SSE2 instruction support! This is arguably a bug in the optimizer, but explicit use of the intrinsic works around it.