Fastest way to compute distance squared

2020-08-10 08:00发布

问题:

My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster. I now have:

double distance_squared(double *a, double *b)
{
  double dx = a[0] - b[0];
  double dy = a[1] - b[1];
  double dz = a[2] - b[2];

  return dx*dx + dy*dy + dz*dz;
}

I also tried using a macro to avoid the function call but it doesn't help much.

#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])

I thought about using SIMD instructions but could not find a good example or complete list of instructions (ideally some multiply+add on two vectors).

GPU's are not an option since only one set of points is known at each function call.

What would be the fastest way to compute the distance squared?

回答1:

A good compiler will optimize that about as well as you will ever manage. A good compiler will use SIMD instructions if it deems that they are going to be beneficial. Make sure that you turn on all such possible optimizations for your compiler. Unfortunately, vectors of dimension 3 don't tend to sit well with SIMD units.

I suspect that you will simply have to accept that the code produced by the compiler is probably pretty close to optimal and that no significant gains can be made.



回答2:

The first obvious thing would be to use the restrict keyword.

As it is now, a and b are aliasable (and thus, from the compiler's point of view which assumes the worst possible case are aliased). No compiler will auto-vectorize this, as it is wrong to do so.

Worse, not only can the compiler not vectorize such a loop, in case you also store (luckily not in your example), it also must re-load values each time. Always be clear about aliasing, as it greatly impacts the compiler.

Next, if you can live with that, use float instead of double and pad to 4 floats even if one is unused, this is a more "natural" data layout for the majority of CPUs (this is somewhat platform specific, but 4 floats is a good guess for most platforms -- 3 doubles, a.k.a. 1.5 SIMD registers on "typical" CPUs, is not optimal anywhere).

(For a hand-written SIMD implementation (which is harder than you think), first and before all be sure to have aligned data. Next, look into what latencies your instrucitons have on the target machine and do the longest ones first. For example on pre-Prescott Intel it makes sense to first shuffle each component into a register and then multiply with itself, even though that uses 3 multiplies instead of one, because shuffles have a long latency. On the later models, a shuffle takes a single cycle, so that would be a total anti-optimization.
Which again shows that leaving it to the compiler is not such a bad idea.)



回答3:

The SIMD code to do this (using SSE3):

movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0

but you need four value vectors (x,y,z,0) for this to work. If you've only got three values then you'd need to do a bit of fiddling about to get the required format which would cancel out any benefit of the above.

In general though, due to the superscalar pipelined architecture of the CPU, the best way to get performance is to do the same operation on lots of data, that way you can interleave the various steps and do a bit of loop unrolling to avoid pipeline stalls. The above code will definately stall on the last three instructions based on the "can't use a value directly after it's modified" principle - the second instruction has to wait for the result of the previous instruction to complete which isn't good in a pipelined system.

Doing the calculation on two or more different sets points of points at the same time can remove the above bottleneck - whilst waiting for the result of one computation, you can start the computation of the next point:

movaps xmm0,a1
                  movaps xmm2,a2
movaps xmm1,b1
                  movaps xmm3,b2
subps xmm0,xmm1
                  subps xmm2,xmm3
mulps xmm0,xmm0
                  mulps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2


回答4:

If you would like to optimize something, at first profile code and inspect assembler output.

After compiling it with gcc -O3 (4.6.1) we'll have nice disassembled output with SIMD:

movsd   (%rdi), %xmm0
movsd   8(%rdi), %xmm2
subsd   (%rsi), %xmm0
movsd   16(%rdi), %xmm1
subsd   8(%rsi), %xmm2
subsd   16(%rsi), %xmm1
mulsd   %xmm0, %xmm0
mulsd   %xmm2, %xmm2
mulsd   %xmm1, %xmm1
addsd   %xmm2, %xmm0
addsd   %xmm1, %xmm0


回答5:

This type of problem often occurs in MD simulations. Usually the amount of calculations is reduced by cutoffs and neighbor lists, so the number for the calculation is reduced. The actual calculation of the squared distances however is exactly done (with compiler optimizations and a fixed type float[3]) as given in your question.

So if you want to reduce the amount of squared calculations you should tell us more about the problem.



回答6:

Perhaps passing the 6 doubles directly as arguments could make it faster (because it could avoid the array dereference):

inline double distsquare_coord(double xa, double ya, double za, 
                               double xb, double yb, double zb) 
{ 
  double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
  return dx*dx + dy*dy + dz*dz; 
}

Or perhaps, if you have many points in the vicinity, you might compute a distance (to the same fixed other point) by linear approximation of the distances of other near points.



回答7:

If you can rearrange your data to process two pairs of input vectors at once, you may use this code (SSE2 only)

// @brief Computes two squared distances between two pairs of 3D vectors
// @param a
//  Pointer to the first pair of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param b
//  Pointer to the second pairs of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param c
//  Pointer to the output 2 element array.
//  Must be aligned by 16 (2 doubles).
//  The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
    // diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
    __m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
    // diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
    __m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
    // diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
    __m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));

    // prod0 = ( d0.y * d0.y, d0.x * d0.x )
    __m128d prod0 = _mm_mul_pd(diff0, diff0);
    // prod1 = ( d1.x * d1.x, d0.z * d0.z )
    __m128d prod1 = _mm_mul_pd(diff1, diff1);
    // prod2 = ( d1.z * d1.z, d1.y * d1.y )
    __m128d prod2 = _mm_mul_pd(diff1, diff1);

    // _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
    // psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
    __m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);

    // _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
    // dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
    __m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);

    __mm_store_pd(c, dotprod);
}