-->

The program runs 3 times slower when compiled with

2019-03-17 22:21发布

问题:

Recently, I've started to use Ubuntu 16.04 with g++ 5.3.1 and checked that my program runs 3 times slower. Before that I've used Ubuntu 14.04, g++ 4.8.4. I built it with the same commands: CFLAGS = -std=c++11 -Wall -O3.

My program contains cycles, filled with math calls (sin, cos, exp). You can find it here.

I've tried to compile with different optimization flags (O0, O1, O2, O3, Ofast), but in all cases the problem is reproduced (with Ofast both variants run faster, but the first runs 3 times slower still).

In my program I use libtinyxml-dev, libgslcblas. But they have the same versions in both cases and don't take any significant part in the program (according to code and callgrind profiling) in terms of performance.

I've performed profiling, but it doesn't give me any idea about why it happens. Kcachegrind comparison (left is slower). I've only noticed that now the program uses libm-2.23 compared to libm-2.19 with Ubuntu 14.04.

My processor is i7-5820, Haswell.

I have no idea why it becomes slower. Do you have any ideas?

P.S. Below you can find the most time consuming function:

void InclinedSum::prepare3D()
{
double buf1, buf2;
double sum_prev1 = 0.0, sum_prev2 = 0.0;
int break_idx1, break_idx2; 
int arr_idx;

for(int seg_idx = 0; seg_idx < props->K; seg_idx++)
{
    const Point& r = well->segs[seg_idx].r_bhp;

    for(int k = 0; k < props->K; k++)
    {
        arr_idx = seg_idx * props->K + k;
        F[arr_idx] = 0.0;

        break_idx2 = 0;

        for(int m = 1; m <= props->M; m++)
        {
            break_idx1 = 0;

            for(int l = 1; l <= props->L; l++)
            {
                buf1 = ((cos(M_PI * (double)(m) * well->segs[k].r1.x / props->sizes.x - M_PI * (double)(l) * well->segs[k].r1.z / props->sizes.z) -
                            cos(M_PI * (double)(m) * well->segs[k].r2.x / props->sizes.x - M_PI * (double)(l) * well->segs[k].r2.z / props->sizes.z)) /
                        ( M_PI * (double)(m) * tan(props->alpha) / props->sizes.x + M_PI * (double)(l) / props->sizes.z ) + 
                            (cos(M_PI * (double)(m) * well->segs[k].r1.x / props->sizes.x + M_PI * (double)(l) * well->segs[k].r1.z / props->sizes.z) -
                            cos(M_PI * (double)(m) * well->segs[k].r2.x / props->sizes.x + M_PI * (double)(l) * well->segs[k].r2.z / props->sizes.z)) /
                        ( M_PI * (double)(m) * tan(props->alpha) / props->sizes.x - M_PI * (double)(l) / props->sizes.z )
                            ) / 2.0;

                buf2 = sqrt((double)(m) * (double)(m) / props->sizes.x / props->sizes.x + (double)(l) * (double)(l) / props->sizes.z / props->sizes.z);

                for(int i = -props->I; i <= props->I; i++)
                {   

                    F[arr_idx] += buf1 / well->segs[k].length / buf2 *
                        ( exp(-M_PI * buf2 * fabs(r.y - props->r1.y + 2.0 * (double)(i) * props->sizes.y)) - 
                        exp(-M_PI * buf2 * fabs(r.y + props->r1.y + 2.0 * (double)(i) * props->sizes.y)) ) *
                        sin(M_PI * (double)(m) * r.x / props->sizes.x) * 
                        cos(M_PI * (double)(l) * r.z / props->sizes.z);
                }

                if( fabs(F[arr_idx] - sum_prev1) > F[arr_idx] * EQUALITY_TOLERANCE )
                {
                    sum_prev1 = F[arr_idx];
                    break_idx1 = 0;
                } else
                    break_idx1++;

                if(break_idx1 > 1)
                {
                    //std::cout << "l=" << l << std::endl;
                    break;
                }
            }

            if( fabs(F[arr_idx] - sum_prev2) > F[arr_idx] * EQUALITY_TOLERANCE )
            {
                sum_prev2 = F[arr_idx];
                break_idx2 = 0;
            } else
                break_idx2++;

            if(break_idx2 > 1)
            {
                std::cout << "m=" << m << std::endl;
                break;
            }
        }
    }
}
}

Further investigation. I wrote the following simple program:

#include <cmath>
#include <iostream>
#include <chrono>

#define CYCLE_NUM 1E+7

using namespace std;
using namespace std::chrono;

int main()
{
    double sum = 0.0;

    auto t1 = high_resolution_clock::now();
    for(int i = 1; i < CYCLE_NUM; i++)
    {
        sum += sin((double)(i)) / (double)(i);
    }
    auto t2 = high_resolution_clock::now();

    microseconds::rep t = duration_cast<microseconds>(t2-t1).count();

    cout << "sum = " << sum << endl;
    cout << "time = " << (double)(t) / 1.E+6 << endl;

    return 0;
}

I am really wondering why this simple sample program is 2.5 faster under g++ 4.8.4 libc-2.19 (libm-2.19) than under g++ 5.3.1 libc-2.23 (libm-2.23).

The compile command was:

g++ -std=c++11 -O3 main.cpp -o sum

Using other optimization flags don't change the ratio.

How can I understand who, gcc or libc, slow down the program?

回答1:

This is a bug in glibc that affects versions 2.23 (in use in Ubuntu 16.04) and early versions of 2.24 (e.g. Fedora and Debian already include the patched versions that are no longer affected, Ubuntu 16.10 and 17.04 do not yet).

The slowdown stems from the SSE to AVX register transition penalty. See the glibc bug report here: https://sourceware.org/bugzilla/show_bug.cgi?id=20495

Oleg Strikov wrote up a quite extensive analysis in his Ubuntu bug report: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280

Without the patch, there are various possible workarounds: you can compile your problem statically (i.e. add -static) or you can disable lazy binding by setting the environment variable LD_BIND_NOW during the program's execution. Again, more details in the above bug reports.



回答2:

For a really precise answer, you'll probably need a libm maintainer to look at your question. However, here is my take - take it as a draft, if I find something else I'll add it to this answer.

First, look at the asm generated by GCC, between gcc 4.8.2 and gcc 5.3. There are only 4 differences:

  • at the beginning a xorpd gets transformed into a pxor, for the same registers
  • a pxor xmm1, xmm1 was added before the conversion from int to double (cvtsi2sd)
  • a movsd was moved just before the conversion
  • the addition (addsd) was moved just before a comparison (ucomisd)

All of this is probably not sufficient for the decrease in performance. Having a fine profiler (intel for example) could allow to be more conclusive, but I don't have access to one.

Now, there is a dependency on sin, so let's see what changed. And the problem is first identifying what platform you use... There are 17 different subfolders in glibc's sysdeps (where sin is defined), so I went for the x86_64 one.

First, the way processor capabilities are handled changed, for example glibc/sysdeps/x86_64/fpu/multiarch/s_sin.c used to do the checking for FMA / AVX in 2.19, but in the 2.23 it is done externally. There could be a bug in which the capabilities are not properly reported, resulting in not using FMA or AVX. I however don't think this hypothesis as very plausible.

Secondly, in .../x86_64/fpu/s_sinf.S, the only modifications (apart from a copyright update) change the stack offset, aligning it to 16 bytes; idem for sincos. Not sure it would make a huge difference.

However, the 2.23 added a lot of sources for vectorized versions of math functions, and some use AVX512 - which your processor probably don't support because it is really new. Maybe libm tries to use such extensions, and since you don't have them, fallback on generic version ?

EDIT: I tried compiling it with gcc 4.8.5, but for it I need to recompile glibc-2.19. For the moment I cannot link, because of this:

/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libm.a(s_sin.o): in function « __cos »:
(.text+0x3542): undefined reference to « _dl_x86_cpu_features »
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libm.a(s_sin.o): in function « __sin »:
(.text+0x3572): undefined reference to « _dl_x86_cpu_features »

I will try to resolve this, but beforehand notice that it is very probable that this symbol is responsible for choosing the right optimized version based on the processor, which may be part of the performance hit.