I am currently working on a project where we need to solve
|Ax - b|^2
.
In this case, A
is a very sparse matrix and A'A
has at most 5 nonzero elements in each row.
We are working with images and the dimension of A'A
is NxN
where N is the number of pixels. In this case N = 76800
. We plan to go to RGB
and then the dimension will be 3Nx3N
.
In matlab solving (A'A)\(A'b)
takes about 0.15 s, using doubles.
I have now done some experimenting with Eigens
sparse solvers. I have tried:
SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient
and some different orderings. The by far best so far is
SimplicialLDLT
which takes about 0.35 - 0.5
using AMDOrdering
.
When I for example use ConjugateGradient
it takes roughly 6 s
, using 0
as initilization.
The code for solving the problem is:
A_tot.makeCompressed();
// Create solver
Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
// Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
solver.analyzePattern(A_tot);
t1 = omp_get_wtime();
solver.compute(A_tot);
if (solver.info() != Eigen::Success)
{
std::cerr << "Decomposition Failed" << std::endl;
getchar();
}
Eigen::VectorXf opt = solver.solve(b_tot);
t2 = omp_get_wtime();
std::cout << "Time for normal equations: " << t2 - t1 << std::endl;
This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s
is a minimum.
I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse
and OpenMP
in Eigen
. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient
really be that slow?
Edit:
Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?
Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.
A lot has happened in the field of Cuda and I am not really up to date.
Here are two links I found: Benchmark?, MatlabGPU
Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that
SimplicialLDLT
is the fastest here. Since the sparsity pattern is fixed, callanalyzePattern
once, and thenfactorize
instead ofcompute
. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step usingNaturalOrdering
(not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).