Best way of solving sparse linear systems in C++ -

I am currently working on a project where we need to solve

|Ax - b|^2.

In this case, A is a very sparse matrix and A'A has at most 5 nonzero elements in each row.

We are working with images and the dimension of A'A is NxN where N is the number of pixels. In this case N = 76800. We plan to go to RGB and then the dimension will be 3Nx3N.

In matlab solving (A'A)\(A'b) takes about 0.15 s, using doubles.

I have now done some experimenting with Eigens sparse solvers. I have tried:

SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient

and some different orderings. The by far best so far is

SimplicialLDLT

which takes about 0.35 - 0.5 using AMDOrdering.

When I for example use ConjugateGradient it takes roughly 6 s, using 0 as initilization.

The code for solving the problem is:

    A_tot.makeCompressed();
     // Create solver
    Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
    // Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
    solver.analyzePattern(A_tot);
    t1 = omp_get_wtime();
    solver.compute(A_tot);
    if (solver.info() != Eigen::Success)
     {
         std::cerr << "Decomposition Failed" << std::endl;
         getchar();
     }
    Eigen::VectorXf opt = solver.solve(b_tot);
    t2 = omp_get_wtime();
    std::cout << "Time for normal equations: " << t2 - t1 << std::endl;

This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s is a minimum.

I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse and OpenMP in Eigen. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient really be that slow?

Edit:

Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?

Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.

A lot has happened in the field of Cuda and I am not really up to date.

Here are two links I found: Benchmark?, MatlabGPU

Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that SimplicialLDLT is the fastest here. Since the sparsity pattern is fixed, call analyzePattern once, and then factorize instead of compute. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step using NaturalOrdering (not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).