I am writing a C++ program to generate Mandelbrot set zooms. All of my complex numbers were originally two double
s (one for the real part, one for the complex part). This was working pretty fast; 15 seconds per frame for the type of image I was generating.
Because of the zooming effect, I wanted to increase precision for the more zoomed-in frames, since these frames have such a small difference between the min_x
and max_x
. I looked to GMP to help me out with this.
Now, it is much much slower; 15:38 minutes per frame. The settings for the image are the same as before and the algorithm is the same. The only thing that has changed is that I am using mpf_class
for the decimals which need to be precise (i.e. just the complex numbers). To compare performance, I used the same precision as double: mpf_set_default_prec(64);
Does GMP change the precision of mpf_class
to meet the needs of the expression? In other words, if I have two 64 bit mpf_class
objects and I do a calculation with them and store the result in another mpf_class
, is the precision potentially increased? This would ruin performance over time I would think, but I am not sure that this is what is causing my issue.
My Quesions: Is this performance drop just the nature of GMP and other arbitrary precision libraries? What advice would you give?
Edit 1
I am (i.e. have always been) using -O3
flag for optimizing. I have also ran a test to verify that GMP is not automatically increasing the precision of the mpf_class
objects. So the question still remains as to the reason for the drastic performance decrease.
Edit 2
As a demonstrative example, I compiled the following code as g++ main.cpp -lgmp -lgmpxx
once as shown below, and once with every double
replaced with mpf_class
. With double
it ran in 12.75 seconds and with mpf_class
it ran in 24:54 minutes. Why is this the case when they are the same precision?
#include <gmpxx.h>
double linear_map(double d, double a1, double b1, double a2, double b2) {
double a = (d-a1)/(b1-a1);
return (a*(b2-a2)) + (a2);
}
int iterate(double x0, double y0) {
double x, y;
x = 0;
y = 0;
int i;
for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
double xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
}
return i;
}
int main() {
mpf_set_default_prec(64);
for (int j = 0; j < 3200; j++) {
for (int i = 0; i < 3200; i++) {
double x = linear_map(i, 0, 3200, -2, 1);
double y = linear_map(j, 0, 3200, -1.5, 1.5);
iterate(x, y);
}
}
return 0;
}
As explained in the comments, this kind of slowdown is entirely expected from a library such as GMP.
Builtin
double
multiplications are one of the areas where current-day CPUs and compilers are most optimized; CPUs have multiple execution units that manage to execute in parallel multiple floating point operations, often aided by the compilers, which try to auto-vectorize loops (although this isn't particularly applicable to your case, as your innermost loop has a strong dependency from the previous iteration).On the other hand, in multiple, dynamic precision libraries such as GMP each operation amounts to lots of work - there are multiple branches to examine even just to check if both operands have the same/right amount of precision, and calculation algorithms implemented are generic and tailored towards the "higher precision" end, which means that they aren't particularly optimized for your current use case (using them with the same precision as
double
); also, GMP values can and do allocate memory when they are created, another costly operation.I took your program and modified it slightly to make it parametric over the type to use (with a
#define
), reducing the side of the sampled square (from 3200 to 800, to make tests faster) and adding an accumulator of the return value ofiterate
to print it at the end, both to check if everything is working the same way between the various versions, and to make sure that the optimizer doesn't drop the loop completely.The
double
version on my machine takes roughly 0.16 seconds, and, ran into a profiler, exhibits a completely flat profile in the flamegraph; everything happens initerate
.The GMP version instead, as expected, takes 45 seconds (300x; you talked about a 60x slowdown, but you were comparing with an unoptimized base case) and is way more varied:
As before,
iterate
is taking pretty much the whole time (so we can ignore completelylinear_map
as far as optimization is concerned). All those "towers" are calls into GMP code; the__gmp_expr<...>
stuff isn't particularly relevant - they are just template boilerplate to make complex expressions evaluate without too many temporaries, and gets completely inlined. The bulk of time is spent on the top of those towers, where the actual calculations are performed.Indeed, ultimately most of the time is spent in GMP primitives and in memory allocation:
Given that we cannot touch the GMP internals, the only thing we can do is to be more careful using it, as each GMP operation is indeed costly.
Indeed it's important to keep in mind that, while the compiler can avoid calculating multiple times the same expressions for
double
values, it cannot do the same for GMP values, as they both have side effects (memory allocation, external function calls) and are too complicated to be examined by it anyway. In your inner loop we have:(
T
is the generic type I'm using, defined todouble
ormpf_class
)Here you are calculating
x*x
andy*y
twice at each iteration. We can optimize it as:Re-running the GMP version with this modification we get down to 38 seconds, which is a 18% improvement.
Notice that we kept
xsq
andysq
out of the loop to avoid re-creating them at each iteration; this is because, unlikedouble
values (which ultimately are just register space or, at worse, stack space, both of which are free and handled statically by the compiler),mpt_class
objects aren't free to re-create every time, as was hinted by the prominence of memory allocation functions in the profiler trace above; I'm not entirely aware of the inner workings of the GMP C++ wrapper, but I suspect that it enjoys optimizations similar tostd::vector
- on assignment an already allocated value will be able to recycle its space on allocation instead of allocating again.Hence, we can hoist even the
xtemp
definition out of the loopwhich brings the runtime down to 33 seconds, which is 27% less than the original time.
The flamegraph is similar to the one before, but it seems more compact - we have shaved off some of the "interstitial" wastes of times, leaving just the core of the calculations. Most importantly, looking at the top hotspots, we can indeed see that multiplication switched place with subtraction, and
malloc
/free
lost several positions.I don't think that this can be optimized much more from a purely black-box perspective. If those are the calculations to do, I fear there's no easy way to perform them faster using GMP's
mpf_class
. At this point you should either:Notes
perf record
from linux-tools 4.15 with call stack capture; graphs & tables generated by KDAB Hotspot 1.1.0