cblas gemm time dependent on input matrix values -

2020-06-28 00:56发布

This is an extension of my earlier question, but I am asking it separately because I am getting really frustrated, so please do not down-vote it!

Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?

I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices

A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>

using namespace std;

int main()
{
const int m = 5000; 

timespec blas_start, blas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m

float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input; // enter 0 for sparse

// input martix A: every 32nd element is non-zero, rest of the values = input
for(i = 0; i < m; i++)
{       
A[i] = input;
if( i % 32 == 0)    //adjust for sparsity
        A[i] = i;
}

// input matrix B: identity matrix
for(i = 0; i < m; i++)
        for(j = 0; j < m; j++)
            B[i*m + j] = (i==j);

clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);

/* for(i = 0; i < m; i++)
        printf("%f ", C[i]);
printf("\n\n");    */

// Print time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";

return 0;
}

I run it as follows in Ubuntu 14.04 with blas 3.0

erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0291558
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.000959521

EDIT

If I replace the gemm call with the following gemv call then matrix sparsity does not matter

//cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);

Results

erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0301581
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.0299282

But the issue is that I am trying to optimize someone else's code using cublas and he is successfully and efficiently using gemm to perform this vector-matrix multiplication. So, I have to test against it or to categorically prove this call to be incorrect

EDIT

I have even updated my blas library today by using

sudo apt-get install libblas-dev liblapack-dev

EDIT: Executed the following commands as suggested by different contributors

erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root   26 مارچ  13  2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ  13  2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root   28 مارچ  13  2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root   30 مارچ  13  2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root   32 مارچ  13  2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf

erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
    linux-gate.so.1 =>  (0xb76f6000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
     libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
     libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
     libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)

1条回答
放荡不羁爱自由
2楼-- · 2020-06-28 01:28

Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?

It seems that the BLAS implementation provided by the default libblas-dev package for Ubuntu 14.04 (and probably other Ubuntu distributions) includes an optimization for cases where certain matrix elements are zero.

For Ubuntu 14.04, the source code for the BLAS (and cblas) implementation/package can be downloaded from here.

After unpacking that archive, we have a cblas/src directory that contains the cblas API, and we have another src directory that contains F77 implementations of various blas routines.

In the case of cblas_sgemm, when the parameter CblasRowMajor is specified, the cblas/src/cblas_sgemm.c code will call the underlying fortran routine as follows:

void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
             const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
             const int K, const float alpha, const float  *A,
             const int lda, const float  *B, const int ldb,
             const float beta, float  *C, const int ldc)
{
...
} else if (Order == CblasRowMajor)
...
  F77_sgemm(F77_TA, F77_TB, &F77_N, &F77_M, &F77_K, &alpha, B, &F77_ldb, A, &F77_lda, &beta, C, &F77_ldc);

Note that for this row major call, the order of the A and B matrices are reversed when passed to the F77_sgemm routine. This is sensible but I won't delve into why here. It's sufficient to note that A has become B in the fortran call/code, and B has become A.

When we inspect the corresponding fortran routine in src/sgemm.f, we see the following sequence of code:

*
*     Start the operations.
*
      IF (NOTB) THEN
          IF (NOTA) THEN
*
*           Form  C := alpha*A*B + beta*C.
*
              DO 90 J = 1,N
                  IF (BETA.EQ.ZERO) THEN
                      DO 50 I = 1,M
                          C(I,J) = ZERO
   50                 CONTINUE
                  ELSE IF (BETA.NE.ONE) THEN
                      DO 60 I = 1,M
                          C(I,J) = BETA*C(I,J)
   60                 CONTINUE
                  END IF
                  DO 80 L = 1,K
                      IF (B(L,J).NE.ZERO) THEN        ***OPTIMIZATION
                          TEMP = ALPHA*B(L,J)
                          DO 70 I = 1,M
                              C(I,J) = C(I,J) + TEMP*A(I,L)
   70                     CONTINUE
                      END IF
   80             CONTINUE
   90         CONTINUE

The above is the section of code that handles the case where No transpose of A and No transpose of B are indicated (which is true for this cblas row-major test case). The matrix row/column multiply operation is handled at the loops beginning where I have added the note ***OPTIMIZATION. In particular, if the matrix element B(L,J) is zero, then the DO-loop closing at line 70 is skipped. But remember B here corresponds to the A matrix passed to the cblas_sgemm routine.

The skipping of this do-loop allows the sgemm function implemented this way to be substantially faster for the cases where there are a large number of zeroes in the A matrix passed to cblas_sgemm when row-major is specified.

Experimentally, not all blas implementation have this optimization. Testing on the exact same platform but using libopenblas-dev instead of libblas-dev provides no such speed-up, i.e. essentially no execution time difference when the A matrix is mostly zeroes, vs. the case when it is not.

Note that the fortran (77) code here appears to be similar to or identical to older published versions of the sgemm.f routine such as here. Newer published versions of this fortran routine that I could find do not contain this optimization, such as here.

查看更多
登录 后发表回答