Should I expect GMP's mpf_class to be much slo

2019-07-18 02:43发布

I am writing a C++ program to generate Mandelbrot set zooms. All of my complex numbers were originally two doubles (one for the real part, one for the complex part). This was working pretty fast; 15 seconds per frame for the type of image I was generating.

Because of the zooming effect, I wanted to increase precision for the more zoomed-in frames, since these frames have such a small difference between the min_x and max_x. I looked to GMP to help me out with this.

Now, it is much much slower; 15:38 minutes per frame. The settings for the image are the same as before and the algorithm is the same. The only thing that has changed is that I am using mpf_class for the decimals which need to be precise (i.e. just the complex numbers). To compare performance, I used the same precision as double: mpf_set_default_prec(64);

Does GMP change the precision of mpf_class to meet the needs of the expression? In other words, if I have two 64 bit mpf_class objects and I do a calculation with them and store the result in another mpf_class, is the precision potentially increased? This would ruin performance over time I would think, but I am not sure that this is what is causing my issue.

My Quesions: Is this performance drop just the nature of GMP and other arbitrary precision libraries? What advice would you give?

Edit 1
I am (i.e. have always been) using -O3 flag for optimizing. I have also ran a test to verify that GMP is not automatically increasing the precision of the mpf_class objects. So the question still remains as to the reason for the drastic performance decrease.

Edit 2
As a demonstrative example, I compiled the following code as g++ main.cpp -lgmp -lgmpxx once as shown below, and once with every double replaced with mpf_class. With double it ran in 12.75 seconds and with mpf_class it ran in 24:54 minutes. Why is this the case when they are the same precision?

#include <gmpxx.h>

double linear_map(double d, double a1, double b1, double a2, double b2) {
  double a = (d-a1)/(b1-a1);
  return (a*(b2-a2)) + (a2);
}

int iterate(double x0, double y0) {
  double x, y;
  x = 0;
  y = 0;
  int i;
  for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
  }
  return i;
}

int main() {
  mpf_set_default_prec(64);
  for (int j = 0; j < 3200; j++) {
    for (int i = 0; i < 3200; i++) {
      double x = linear_map(i, 0, 3200, -2, 1);
      double y = linear_map(j, 0, 3200, -1.5, 1.5);
      iterate(x, y);
    }
  }
  return 0;
}

1条回答
够拽才男人
2楼-- · 2019-07-18 03:37

As explained in the comments, this kind of slowdown is entirely expected from a library such as GMP.

Builtin double multiplications are one of the areas where current-day CPUs and compilers are most optimized; CPUs have multiple execution units that manage to execute in parallel multiple floating point operations, often aided by the compilers, which try to auto-vectorize loops (although this isn't particularly applicable to your case, as your innermost loop has a strong dependency from the previous iteration).

On the other hand, in multiple, dynamic precision libraries such as GMP each operation amounts to lots of work - there are multiple branches to examine even just to check if both operands have the same/right amount of precision, and calculation algorithms implemented are generic and tailored towards the "higher precision" end, which means that they aren't particularly optimized for your current use case (using them with the same precision as double); also, GMP values can and do allocate memory when they are created, another costly operation.

I took your program and modified it slightly to make it parametric over the type to use (with a #define), reducing the side of the sampled square (from 3200 to 800, to make tests faster) and adding an accumulator of the return value of iterate to print it at the end, both to check if everything is working the same way between the various versions, and to make sure that the optimizer doesn't drop the loop completely.

The double version on my machine takes roughly 0.16 seconds, and, ran into a profiler, exhibits a completely flat profile in the flamegraph; everything happens in iterate.

flamegraph of double-based version

The GMP version instead, as expected, takes 45 seconds (300x; you talked about a 60x slowdown, but you were comparing with an unoptimized base case) and is way more varied:

flamegraph of GMP-based version

As before, iterate is taking pretty much the whole time (so we can ignore completely linear_map as far as optimization is concerned). All those "towers" are calls into GMP code; the __gmp_expr<...> stuff isn't particularly relevant - they are just template boilerplate to make complex expressions evaluate without too many temporaries, and gets completely inlined. The bulk of time is spent on the top of those towers, where the actual calculations are performed.

Indeed, ultimately most of the time is spent in GMP primitives and in memory allocation:

caller/callee times for GMP-based version

Given that we cannot touch the GMP internals, the only thing we can do is to be more careful using it, as each GMP operation is indeed costly.

Indeed it's important to keep in mind that, while the compiler can avoid calculating multiple times the same expressions for double values, it cannot do the same for GMP values, as they both have side effects (memory allocation, external function calls) and are too complicated to be examined by it anyway. In your inner loop we have:

  double x, y;
  x = 0;
  y = 0;
  int i;
    for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
      T xtemp = x*x - y*y + x0;

(T is the generic type I'm using, defined to double or mpf_class)

Here you are calculating x*x and y*y twice at each iteration. We can optimize it as:

T x = 0, y = 0, xsq, ysq;
for(i = 0; i < 1000; i++) {
  xsq = x*x;
  ysq = y*y;
  if(xsq+ysq > 65536) break;
  T xtemp = xsq - ysq + y0;

Re-running the GMP version with this modification we get down to 38 seconds, which is a 18% improvement.

Notice that we kept xsq and ysq out of the loop to avoid re-creating them at each iteration; this is because, unlike double values (which ultimately are just register space or, at worse, stack space, both of which are free and handled statically by the compiler), mpt_class objects aren't free to re-create every time, as was hinted by the prominence of memory allocation functions in the profiler trace above; I'm not entirely aware of the inner workings of the GMP C++ wrapper, but I suspect that it enjoys optimizations similar to std::vector - on assignment an already allocated value will be able to recycle its space on allocation instead of allocating again.

Hence, we can hoist even the xtemp definition out of the loop

int iterate(T x0, T y0) {
  T x = 0, y = 0, xsq , ysq, xtemp;
  int i;
  for (i = 0; i < 1000; i++) {
    xsq = x*x;
    ysq = y*y;
    if(xsq+ysq > 65536) break;
    xtemp = xsq - ysq + y0;
    y = 2*x*y + y0;
    x = xtemp;
  }
  return i;
}

which brings the runtime down to 33 seconds, which is 27% less than the original time.

flamegraph of GMP-based optimized version

The flamegraph is similar to the one before, but it seems more compact - we have shaved off some of the "interstitial" wastes of times, leaving just the core of the calculations. Most importantly, looking at the top hotspots, we can indeed see that multiplication switched place with subtraction, and malloc/free lost several positions.

caller/callee times of GMP-based optimized version


I don't think that this can be optimized much more from a purely black-box perspective. If those are the calculations to do, I fear there's no easy way to perform them faster using GMP's mpf_class. At this point you should either:

  • drop GMP and use some other library, with better performance for the fixed-size case; I suspect there is something to gain here - even just avoiding completely allocations and inlining the calculation is a big win;
  • start applying some algorithmic optimizations; these will provide significantly shorter run time whatever data type you'll ultimately decide to use.

Notes

  • the full code (in its various iterations) can be found at https://bitbucket.org/mitalia/mandelbrot_gmp_test/
  • all tests done with g++ 7.3 with optimization level -O3 on 64 bit Linux running on my i7-6700
  • profiling performed using perf record from linux-tools 4.15 with call stack capture; graphs & tables generated by KDAB Hotspot 1.1.0
查看更多
登录 后发表回答