Laderman's 3x3 matrix multiplication with only

2019-02-05 21:37发布

Take the product of two 3x3 matrices A*B=C. Naively this requires 27 multiplications using the standard algorithm. If one were clever, you could do this using only 23 multiplications, a result found in 1973 by Laderman. The technique involves saving intermediate steps and combining them in the right way.

Now lets fix a language and a type, say C++ with elements of double. If the Laderman algorithm was hard-coded versus the simple double loop, could we expect the performance of a modern compiler to edge out the differences of the algorithms?

Notes about this question: This is a programming site, and the question is asked in the context of the best practice for a time-critical inner loop; premature optimization this is not. Tips on implementation are greatly welcomed as comments.

4条回答
做自己的国王
2楼-- · 2019-02-05 22:08

I expect the main performance concern would be memory latency. A double[9] is typically 72 bytes. That's already a nontrivial amount, and you're using three of them.

查看更多
来,给爷笑一个
3楼-- · 2019-02-05 22:12

Timing Tests:

I ran the timing tests myself and the results surprised me (hence why I asked the question in the first place). The short of it is, under a standard compile the laderman is ~ 225% faster, but with the -03 optimizing flag it is 50% slower! I had to add a random element into the matrix each time during the -O3 flag or the compiler completely optimized away the simple multiplication, taking a time of zero within clock precision. Since the laderman algorithm was a pain to check/double check I'll post the complete code below for posterity.

Specs: Ubuntu 12.04, Dell Prevision T1600, gcc. Percent difference in timings:

  • g++ [2.22, 2.23, 2.27]
  • g++ -O3 [-0.48, -0.49, -0.48]
  • g++ -funroll-loops -O3 [-0.48, -0.48, -0.47] 

Benchmarking code along with Laderman implementation:

#include <iostream>
#include <ctime>
#include <cstdio>
#include <cstdlib>
using namespace std;

void simple_mul(const double a[3][3], 
        const double b[3][3],
        double c[3][3]) {
  int i,j,m,n;
  for(i=0;i<3;i++) {
    for(j=0;j<3;j++) {
      c[i][j] = 0;
      for(m=0;m<3;m++) 
    c[i][j] += a[i][m]*b[m][j];
    }
  }
}

void laderman_mul(const double a[3][3], 
           const double b[3][3],
           double c[3][3]) {

   double m[24]; // not off by one, just wanted to match the index from the paper

   m[1 ]= (a[0][0]+a[0][1]+a[0][2]-a[1][0]-a[1][1]-a[2][1]-a[2][2])*b[1][1];
   m[2 ]= (a[0][0]-a[1][0])*(-b[0][1]+b[1][1]);
   m[3 ]= a[1][1]*(-b[0][0]+b[0][1]+b[1][0]-b[1][1]-b[1][2]-b[2][0]+b[2][2]);
   m[4 ]= (-a[0][0]+a[1][0]+a[1][1])*(b[0][0]-b[0][1]+b[1][1]);
   m[5 ]= (a[1][0]+a[1][1])*(-b[0][0]+b[0][1]);
   m[6 ]= a[0][0]*b[0][0];
   m[7 ]= (-a[0][0]+a[2][0]+a[2][1])*(b[0][0]-b[0][2]+b[1][2]);
   m[8 ]= (-a[0][0]+a[2][0])*(b[0][2]-b[1][2]);
   m[9 ]= (a[2][0]+a[2][1])*(-b[0][0]+b[0][2]);
   m[10]= (a[0][0]+a[0][1]+a[0][2]-a[1][1]-a[1][2]-a[2][0]-a[2][1])*b[1][2];
   m[11]= a[2][1]*(-b[0][0]+b[0][2]+b[1][0]-b[1][1]-b[1][2]-b[2][0]+b[2][1]);
   m[12]= (-a[0][2]+a[2][1]+a[2][2])*(b[1][1]+b[2][0]-b[2][1]);
   m[13]= (a[0][2]-a[2][2])*(b[1][1]-b[2][1]);
   m[14]= a[0][2]*b[2][0];
   m[15]= (a[2][1]+a[2][2])*(-b[2][0]+b[2][1]);
   m[16]= (-a[0][2]+a[1][1]+a[1][2])*(b[1][2]+b[2][0]-b[2][2]);
   m[17]= (a[0][2]-a[1][2])*(b[1][2]-b[2][2]);
   m[18]= (a[1][1]+a[1][2])*(-b[2][0]+b[2][2]);
   m[19]= a[0][1]*b[1][0];
   m[20]= a[1][2]*b[2][1];
   m[21]= a[1][0]*b[0][2];
   m[22]= a[2][0]*b[0][1];
   m[23]= a[2][2]*b[2][2];

  c[0][0] = m[6]+m[14]+m[19];
  c[0][1] = m[1]+m[4]+m[5]+m[6]+m[12]+m[14]+m[15];
  c[0][2] = m[6]+m[7]+m[9]+m[10]+m[14]+m[16]+m[18];
  c[1][0] = m[2]+m[3]+m[4]+m[6]+m[14]+m[16]+m[17];
  c[1][1] = m[2]+m[4]+m[5]+m[6]+m[20];
  c[1][2] = m[14]+m[16]+m[17]+m[18]+m[21];
  c[2][0] = m[6]+m[7]+m[8]+m[11]+m[12]+m[13]+m[14];
  c[2][1] = m[12]+m[13]+m[14]+m[15]+m[22];
  c[2][2] = m[6]+m[7]+m[8]+m[9]+m[23];    
}

int main() {
  int N = 1000000000;
  double A[3][3], C[3][3];
  std::clock_t t0,t1;
  timespec tm0, tm1;

  A[0][0] = 3/5.; A[0][1] = 1/5.; A[0][2] = 2/5.;
  A[1][0] = 3/7.; A[1][1] = 1/7.; A[1][2] = 3/7.;
  A[2][0] = 1/3.; A[2][1] = 1/3.; A[2][2] = 1/3.;

  t0 = std::clock();
  for(int i=0;i<N;i++) {
    // A[0][0] = double(rand())/RAND_MAX; // Keep this in for -O3
    simple_mul(A,A,C);
  }
  t1 = std::clock();
  double tdiff_simple = (t1-t0)/1000.;

  cout << C[0][0] << ' ' << C[0][1] << ' ' << C[0][2] << endl;
  cout << C[1][0] << ' ' << C[1][1] << ' ' << C[1][2] << endl;
  cout << C[2][0] << ' ' << C[2][1] << ' ' << C[2][2] << endl;
  cout << tdiff_simple << endl;
  cout << endl;

  t0 = std::clock();
  for(int i=0;i<N;i++) {
    // A[0][0] = double(rand())/RAND_MAX; // Keep this in for -O3
    laderman_mul(A,A,C);
  }
  t1 = std::clock();
  double tdiff_laderman = (t1-t0)/1000.;

  cout << C[0][0] << ' ' << C[0][1] << ' ' << C[0][2] << endl;
  cout << C[1][0] << ' ' << C[1][1] << ' ' << C[1][2] << endl;
  cout << C[2][0] << ' ' << C[2][1] << ' ' << C[2][2] << endl;
  cout << tdiff_laderman << endl;
  cout << endl;

  double speedup = (tdiff_simple-tdiff_laderman)/tdiff_laderman;
  cout << "Approximate speedup: " << speedup << endl;

  return 0;
}
查看更多
三岁会撩人
4楼-- · 2019-02-05 22:14

Although the question mentioned C++, I implemented 3x3 matrix multiplication C=A*B in C# (.NET 4.5) and ran some basic timing tests on my 64 bit windows 7 machine with optimizations. 10,000,000 multiplications took about

  1. 0.556 seconds with a naive implementation and
  2. 0.874 seconds with the laderman code from the other answer.

Interestingly, the laderman code was slower than the naive way. I didn't investigate with a profiler, but I guess the extra allocations are more costly than a few extra multiplications.

It seems current compilers are smart enough to do those optimizations for us, which is good. Here's the naive code I used, for your interest:

    public static Matrix3D operator *(Matrix3D a, Matrix3D b)
    {
        double c11 = a.M11 * b.M11 + a.M12 * b.M21 + a.M13 * b.M31;
        double c12 = a.M11 * b.M12 + a.M12 * b.M22 + a.M13 * b.M32;
        double c13 = a.M11 * b.M13 + a.M12 * b.M23 + a.M13 * b.M33;
        double c21 = a.M21 * b.M11 + a.M22 * b.M21 + a.M23 * b.M31;
        double c22 = a.M21 * b.M12 + a.M22 * b.M22 + a.M23 * b.M32;
        double c23 = a.M21 * b.M13 + a.M22 * b.M23 + a.M23 * b.M33;
        double c31 = a.M31 * b.M11 + a.M32 * b.M21 + a.M33 * b.M31;
        double c32 = a.M31 * b.M12 + a.M32 * b.M22 + a.M33 * b.M32;
        double c33 = a.M31 * b.M13 + a.M32 * b.M23 + a.M33 * b.M33;
        return new Matrix3D(
            c11, c12, c13,
            c21, c22, c23,
            c31, c32, c33);
    }

where Matrix3D is an immutable struct (readonly double fields).

The tricky thing is to come up with a valid benchmark, where you measure your code and not, what the compiler did with your code (debugger with tons of extra stuff, or optimized without your actual code since the result was never used). I usually try to "touch" the result, such that the compiler cannot remove the code under test (e.g. check matrix elements for equality with 89038.8989384 and throw, if equal). However, in the end I'm not even sure if the compiler hacks this comparison out of the way :)

查看更多
神经病院院长
5楼-- · 2019-02-05 22:15

The key is mastering the instruction set on your platform. It depends on your platform. There are several techniques, and when you tend to need the maximum possible performance, your compiler will come with profiling tools, some of which have optimization hinting built in. For the finest grained operations look at the assembler output and see if there are any improvements at that level as well.

Simultaneous instruction multiple data commands perform the same operation on several operands in parallel. So that you can take

double a,b,c,d;
double w = d + a; 
double x = a + b;
double y = b + c;
double z = c + d;

and replace it with

double256 dabc = pack256(d, a, b, c);
double256 abcd = pack256(a, b, c, d);
double256 wxyz = dabc + abcd;

So that when the values are loaded into registers, they are loaded into a single 256-bit wide register for some fictional platform with 256-bit wide registers.

Floating point is an important consideration, some DSPs can be significantly faster operating on integers. GPUs tend to be great on floating point, although some are 2x faster at single precision. The 3x3 case of this problem could fit into a single CUDA thread, so you could stream 256 of these calculations simultaneously.

Pick your platform, read the documentation, implement several different methods and profile them.

查看更多
登录 后发表回答