Very slow performance of cusparse csrsv_analysis

I wrote a Conjugate-gradient solver (for linear system of equations) with LU preconditioning, I used Dr. Maxim Naumov's papers on nvidia's research community as a guideline, the residuals update step, which requires solving a lower triangular matrix system and then solving an upper triangular matrix system is divided into two phases:

analysis phase (which exploits the sparsity pattern and decides the parallelization level).
the solution phase itself.

according to all posts related to this topic, and to Naumov's paper itself, the analysis phase is significantly slower than the solve phase, but it's executed once, so it's not supposed to be an issue when taking the whole execution time in consideration, however, in my program, the analysis phase requires ~35-45% of the whole solution time (time needed to perform all iterations!), which is quite annoying, another thing is that I'm quite sure that the matrix's sparsity pattern doesn't allow a big deal of parallelization, as nearly all elements require previous elements to be known (because in CFD each node will need at least the adjacent 6 nodes (hexahedral volumes) around it, and that's repeated on every row), so the analysis phase won't be very useful anyways !

matrixLU in this code contains both the upper and lower triangular matrices, the upper triangular matrix uses the original matrix diagonals, and the lower triangular matrix has unity diagonals (LU factorization).

// z = inv(matrixLU)*r
cusparseMatDescr_t descrL = 0 ;
cusparseMatDescr_t descrU = 0 ;
cusparseStatus = cusparseCreateMatDescr(&descrL) ;
cusparseStatus = cusparseCreateMatDescr(&descrU) ;

cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrL,CUSPARSE_DIAG_TYPE_UNIT) ;
cusparseSetMatFillMode(descrL,CUSPARSE_FILL_MODE_LOWER) ;

cusparseSetMatType(descrU,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrU,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrU,CUSPARSE_DIAG_TYPE_NON_UNIT) ;
cusparseSetMatFillMode(descrU,CUSPARSE_FILL_MODE_UPPER) ;

cusparseSolveAnalysisInfo_t inforL = 0 ;
cusparseSolveAnalysisInfo_t inforU = 0 ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforL) ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforU) ;

double startFA = omp_get_wtime() ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrL, matrixLU, iRow, jCol, inforL) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis1 Error !") ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrU, matrixLU, iRow, jCol, inforU) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis2 Error !") ;
double finishFA = omp_get_wtime() ;

So, anybody got a clue why the analysis phase is so slow ? and how it may be accelerated ? is the (analysis phase execution time)/(solve phase execution time) ratio GPU dependent ?

Edit: I tried this solver for many cases, and the results were close, but in the specific case I'm concerned about, it has the following conditions:

Size (N) : ~860,000 * 860,000
Number of non zeros (NZ): ~6,000,000
Number of iterations required for convergence: 10
Analysis phase execution time: 210 ms
Solve phase executions time (summing up all iterations): 350 ms
All floating point operations are performed in double precision format
GPU: GeForce GTX 550 Ti
OS: Windows 7 ultimate, 64 bit

I contacted our linear algebra libraries team at NVIDIA and they provided the following feedback.

With respect to the performance results:
1. The CG iterative method performs only 10 iterations until the solution of the linear system is found. It seems that in this case this is not enough iterations to ammortize the time consumed by the analysis phase. (Edit:) The number of steps required to converge to the solution varies across different applications. In some unpreconditioned iterative methods do not converge (or take thousands of iterations) and preconditioned iterative methods take tens (or hundreds) of iterations to converge to a solution. The fact that in this particular application the iterative method converges in 10 iterations is not necessarily representative of all applications.
2. The ratio of the analysis and solve phases is 6 = 210/35, which is in line with our experiments on other matrices, where it is usually in the interval [4,10]. It is not clear how the time for the analysis and solve phases compares to the time taken by the sparse triangular solve on the CPU. (This would be interesting information to know.)
With respect to the algorithm:
1. Unfortunately, there are only a few things you can do to recover a little bit of performance (there is no real way to speed up the analysis phase). For example, you can split the matrixLU into separate lower and upper triangular parts. Computing with separate triangular parts is in general faster than computing with triangular parts embedded in a general matrix. If you are using incomplete factorization with 0 fill-in as a preconditioner, you can also take advantage of the incomplete-LU/Cholesky with 0 fill-in on the GPU, that has recently become available in the CUSPARSE library. (Edit:) The incomplete-LU factorization with 0 fill-in is available in CUDA Toolkit 5.0 release. Currently the early access version of this release can be obtained by registered developers on the NVIDIA website.
2. We have never looked at how the ratio of analysis and solve phases varies across different GPUs; all our experiments done were on C2050.
3. The analysis phase is slow because it has to collect information on which rows can be processed together into levels. The solve phase is faster because it only processes the levels (See this paper).

Finally, you also mentioned that you think there is not enough parallelism in the problem. But the amount of parallelism can be hard to guess just from looking at the matrix. If indeed there is little parallelism (every row depends on the previous row), then the CUSPARSE sparse triangular solve might not be the right algorithm for this problem. Also, you can try to run the CG iterative method without preconditioning or with other types of preconditioners that might be more suitable to your problem.

As mentioned earlier, it would be interesting to know how the performance of this code compares on the GPU and CPU.

Edit: Regarding some comments... As mentioned earlier the ultimate number of iterations for a preconditioned iterative method varies across different applications. In some cases it can be very small (for example, 10 as it is in your application), but in others it can be relatively large (measured in 100s). The paper focuses not on the “triumph”, but rather the tradeoffs of the algorithm. It attempts to give the user a deeper understanding of the routine in order for it to be used effectively. Again, if the sparsity pattern at hand does not have any parallelism, this is not the right algorithm for you.

With respect to the CPU and GPU comparisons, there are certain problems (such as dense matrix-matrix or sparse matrix-vector multiplication) for which GPU is great, there are others (such as traversing a linked list or executing a completely sequential piece of code) for which it is not. The solution of sparse triangular linear systems lies somewhere in between those algorithms (depending on the sparsity pattern of the coefficient matrix and other properties of the linear system it might or not work well for you). Also, the decision to use the GPU is not based solely on a single algorithm, such as sparse triangular solve, but on the entire set of algorithms used by the application. Ultimately, GPUs are often very successful in accelerating and improving the performance of the overall application.

Finally, with respect to trsm vs. csrsv, it does not surprise us that for small matrices performing operations in dense storage is faster than in sparse storage (even though these small matrices might be sparse). This would usually be true not only for triangular solve, but for other linear algebra operations as well (it might happen at different cross points depending on the operation and the sparsity pattern of the matrix, though). Also, once again, the sparse triangular solve algorithm was designed to run in an iterative setting where the analysis is done once and solve done multiple times. So, it is not surprising that running it once (and counting in the analysis phase), results in a higher crossover point for this operation to be more effective in sparse rather than in dense format.