This is an extension of my earlier question, but I am asking it separately because I am getting really frustrated, so please do not down-vote it!
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices
A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input; // enter 0 for sparse
// input martix A: every 32nd element is non-zero, rest of the values = input
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 32 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[i*m + j] = (i==j);
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
/* for(i = 0; i < m; i++)
printf("%f ", C[i]);
printf("\n\n"); */
// Print time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";
return 0;
}
I run it as follows in Ubuntu 14.04 with blas 3.0
erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0291558
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.000959521
EDIT
If I replace the gemm call with the following gemv call then matrix sparsity does not matter
//cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);
Results
erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0301581
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.0299282
But the issue is that I am trying to optimize someone else's code using cublas and he is successfully and efficiently using gemm to perform this vector-matrix multiplication. So, I have to test against it or to categorically prove this call to be incorrect
EDIT
I have even updated my blas library today by using
sudo apt-get install libblas-dev liblapack-dev
EDIT: Executed the following commands as suggested by different contributors
erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root 26 مارچ 13 2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ 13 2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28 مارچ 13 2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30 مارچ 13 2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32 مارچ 13 2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf
erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)
It seems that the BLAS implementation provided by the default libblas-dev package for Ubuntu 14.04 (and probably other Ubuntu distributions) includes an optimization for cases where certain matrix elements are zero.
For Ubuntu 14.04, the source code for the BLAS (and cblas) implementation/package can be downloaded from here.
After unpacking that archive, we have a
cblas/src
directory that contains thecblas
API, and we have anothersrc
directory that contains F77 implementations of various blas routines.In the case of cblas_sgemm, when the parameter
CblasRowMajor
is specified, thecblas/src/cblas_sgemm.c
code will call the underlying fortran routine as follows:Note that for this row major call, the order of the
A
andB
matrices are reversed when passed to theF77_sgemm
routine. This is sensible but I won't delve into why here. It's sufficient to note thatA
has becomeB
in the fortran call/code, andB
has becomeA
.When we inspect the corresponding fortran routine in
src/sgemm.f
, we see the following sequence of code:The above is the section of code that handles the case where No transpose of
A
and No transpose ofB
are indicated (which is true for this cblas row-major test case). The matrix row/column multiply operation is handled at the loops beginning where I have added the note***OPTIMIZATION
. In particular, if the matrix elementB(L,J)
is zero, then the DO-loop closing at line 70 is skipped. But rememberB
here corresponds to theA
matrix passed to thecblas_sgemm
routine.The skipping of this do-loop allows the sgemm function implemented this way to be substantially faster for the cases where there are a large number of zeroes in the
A
matrix passed tocblas_sgemm
when row-major is specified.Experimentally, not all blas implementation have this optimization. Testing on the exact same platform but using
libopenblas-dev
instead oflibblas-dev
provides no such speed-up, i.e. essentially no execution time difference when theA
matrix is mostly zeroes, vs. the case when it is not.Note that the fortran (77) code here appears to be similar to or identical to older published versions of the
sgemm.f
routine such as here. Newer published versions of this fortran routine that I could find do not contain this optimization, such as here.